Evidence-Guided Neural Architecture Selection under Uncertainty for Subject-Specific Blood Glucose Forecasting

arXiv cs.LG 06/05/26, 04:00 AM Papers
Summary
Proposes EVIDENT, a framework that integrates Bayesian training and evidence-based ranking for neural architecture selection, demonstrated on subject-specific blood glucose forecasting in type 1 diabetes, systematically selecting low-capacity models that generalize reliably.
arXiv:2606.05373v1 Announce Type: new Abstract: Reliable neural architecture selection is an open challenge in time-series forecasting under limited, noisy, and heterogeneous data, where standard heuristic architecture design and validation approaches fail to ensure accurate and reliable prediction and generalization. We propose EVIDENT (EVidence-based IDEntification of Neural archiTectures), a framework for architecture selection that integrates Bayesian training, evidence-based ranking, and task-specific validation under uncertainty. The framework explores the candidate architecture pool and identifies the lowest-capacity model that satisfies a prescribed validation criterion. We demonstrate this method using temporal convolutional networks (TCNs) for individualized blood glucose forecasting in type 1 diabetes patients. The results show that EVIDENT systematically rejects both under- and over-parameterized TCN architectures on population-level diabetes data, while identifying models that generalize reliably to unseen patients. When multiple architectures are competitive, the framework further supports plausibility-weighted ensemble predictions that enhance predictive performance. Compared with a random-search baseline, EVIDENT identified smaller architectures with more consistent forecasting performance on unseen patients. These findings establish EVIDENT as a strategy to neural architecture discovery, enabling reliable model selection for high-consequence forecasting in data-limited and heterogeneous settings.
Original Article
View Cached Full Text
Cached at: 06/05/26, 08:10 AM
# Evidence-Guided Neural Architecture Selection under Uncertainty for Subject-Specific Blood Glucose Forecasting
Source: [https://arxiv.org/html/2606.05373](https://arxiv.org/html/2606.05373)
###### Abstract

Reliable neural architecture selection is an open challenge in time\-series forecasting under limited, noisy, and heterogeneous data, where standard heuristic architecture design and validation approaches fail to ensure accurate and reliable prediction and generalization\. We proposeEVIDENT\(EVidence\-based IDEntification of Neural archiTectures\), a framework for architecture selection that integrates Bayesian training, evidence\-based ranking, and task\-specific validation under uncertainty\. The framework explores the candidate architecture pool and identifies the lowest\-capacity model that satisfies a prescribed validation criterion\. We demonstrate this method using temporal convolutional networks \(TCNs\) for individualized blood glucose forecasting in type 1 diabetes patients\. The results show that EVIDENT systematically rejects both under\- and over\-parameterized TCN architectures on population\-level diabetes data, while identifying models that generalize reliably to unseen patients\. When multiple architectures are competitive, the framework further supports plausibility\-weighted ensemble predictions that enhance predictive performance\. Compared with a random\-search baseline, EVIDENT identified smaller architectures with more consistent forecasting performance on unseen patients\. These findings establish EVIDENT as a strategy to neural architecture discovery, enabling reliable model selection for high\-consequence forecasting in data\-limited and heterogeneous settings\.

###### keywords:

Bayesian neural networks, Architecture selection, Model evidence, Time\-series modeling, Temporal convolutional networks

††journal:Neural Networks\\affiliation

\[inst1\]organization=Department of Mechanical and Aerospace Engineering, University at Buffalo,city=Buffalo, state=NY, country=USA

\\affiliation

\[inst2\]organization= Computational and Data\-Enabled Sciences, University at Buffalo,city=Buffalo, state=NY, country=USA

## 1Introduction

Neural network architecture and hyperparameter selection remains a central challenge in time\-series learning\[[1](https://arxiv.org/html/2606.05373#bib.bib1),[2](https://arxiv.org/html/2606.05373#bib.bib2),[3](https://arxiv.org/html/2606.05373#bib.bib3)\], particularly in regimes with limited, noisy, and heterogeneous data\. This challenge becomes more pronounced when the objective extends beyond pointwise accuracy to reliable forecasting for consequential decision making in physical and biomedical systems\. In such settings, architecture design must balance temporal expressivity, effective parameter learning from available data, and generalization to unseen conditions\. In practice, however, architecture choice often relies on trial\-and\-error tuning or heuristic selection based on held\-out performance comparisons\. While these approaches can yield accurate models for a given data split, they provide limited insight into whether the selected architecture’s trainable parameters are appropriately informed by the available data, how sensitive the architecture choice is to data partitioning, and whether the resulting predictions are sufficiently reliable for the intended task\. Subject\-specific blood glucose forecasting in type 1 diabetic patients provides a high\-consequence testbed for this problem\. The goal is to predict future glucose trajectories from continuous glucose monitoring \(CGM\) data together with recorded meal intake and insulin delivery, enabling patient\-specific treatment decisions such as optimizing insulin profile\. This setting is particularly challenging due to complex and delayed glucose\-insulin dynamics, strong inter\- and intra\-patient variability, and wearable sensor noise\. Consequently, architectures that perform well on retrospective historical patient data may still fail to generalize to prospective, subject\-specific predictions, making this application a stringent benchmark for neural architecture selection\.

Despite rapid progress in neural architectures and training algorithms, current approaches for constructing and validating neural network prediction remain limited in several key aspects\. Neural architecture search and related design strategies often rely on heuristic exploration or metaheuristic optimization\[[4](https://arxiv.org/html/2606.05373#bib.bib4),[5](https://arxiv.org/html/2606.05373#bib.bib5),[6](https://arxiv.org/html/2606.05373#bib.bib6),[7](https://arxiv.org/html/2606.05373#bib.bib7),[8](https://arxiv.org/html/2606.05373#bib.bib8)\], yielding architectures that are sensitive to data splits and may not generalize beyond the observed training distribution\. More fundamentally, model validation is typically driven by empirical performance assessment, often under large\-data assumptions, or by problem\-specific heuristics\[[9](https://arxiv.org/html/2606.05373#bib.bib9),[10](https://arxiv.org/html/2606.05373#bib.bib10),[11](https://arxiv.org/html/2606.05373#bib.bib11),[12](https://arxiv.org/html/2606.05373#bib.bib12),[13](https://arxiv.org/html/2606.05373#bib.bib13)\]\. These approaches provide limited insight into whether a given architecture is adequately supported by the available data or whether its predictive uncertainty is acceptable for the intended task\. Such limitations are particularly pronounced in biomedical and clinical time\-series forecasting, where inter\-subject variability and limited patient data can lead to unstable architecture rankings, and where average test\-set accuracy alone is insufficient to assess whether predictions are sufficiently accurate and reliable for the intended forecasting task\.

To address these limitations, we propose EVIDENT \(Evidence\-based Identification of Neural network archiTectures\), a framework for neural architecture selection that integrates Bayesian training\[[14](https://arxiv.org/html/2606.05373#bib.bib14),[15](https://arxiv.org/html/2606.05373#bib.bib15),[16](https://arxiv.org/html/2606.05373#bib.bib16),[17](https://arxiv.org/html/2606.05373#bib.bib17),[18](https://arxiv.org/html/2606.05373#bib.bib18)\], evidence\-based ranking\[[19](https://arxiv.org/html/2606.05373#bib.bib19),[20](https://arxiv.org/html/2606.05373#bib.bib20)\], and task\-specific validation under uncertainty\[[21](https://arxiv.org/html/2606.05373#bib.bib21),[22](https://arxiv.org/html/2606.05373#bib.bib22)\]\. Instead of exhaustively searching large architecture spaces or selecting models solely based on held\-out error, EVIDENT evaluates candidate architectures in a systematic level\-wise manner and returns the lowest\-capacity architecture, i\.e\., lowest number of trainable parameters, that satisfies a prescribed reliability criterion\. The underlying idea is to decouplemodel rankingfrommodel validation, such that Bayesian evidence is used to identify high model plausibility regions of the architecture space, while model acceptance is based on task\-specific validation that explicitly accounts for predictive uncertainty\. In this work, we instantiate EVIDENT using temporal convolutional networks \(TCNs\) for subject\-specific blood glucose forecasting, and demonstrate that approximate evidence consistently localizes a narrow intermediate\-capacity regime, within which validated predictors are identified\.

The remainder of the paper is organized as follows\. Section[2](https://arxiv.org/html/2606.05373#S2)develops the proposed methodology, including the glucose forecasting formulation, the feasible TCN architecture space, Bayesian learning, approximate evidence\-based architecture ranking of TCN predictors, and the EVIDENT algorithmic framework\. Section[3](https://arxiv.org/html/2606.05373#S3)presents the numerical results including a set of plausibility\-guided analysis to examine the behavior of TCN architectures, and then apply EVIDENT to type 1 diabetes population\-level architecture selection under patient\-specific validation\. Section[4](https://arxiv.org/html/2606.05373#S4)discusses the broader implications, limitations, and extensions of the approach, and Section[5](https://arxiv.org/html/2606.05373#S5)provides concluding remarks\.

## 2Methods

### 2\.1Glucose forecasting and feasible architecture space

We consider multi\-step blood\-glucose \(BG\) forecasting from multichannel time series consisting of glucose, meal intake, and insulin delivery sequences\. At discrete time indexkk, the model maps a window of recent observations to a future glucose trajectory over a prediction horizon of lengthHH\. LetG\(k−dG:k\)∈ℝdGG^\{\(k\-d\_\{G\}:k\)\}\\in\\mathbb\{R\}^\{d\_\{G\}\},M\(k−dM:k\)∈ℝdMM^\{\(k\-d\_\{M\}:k\)\}\\in\\mathbb\{R\}^\{d\_\{M\}\}, andI\(k−dI:k\)∈ℝdII^\{\(k\-d\_\{I\}:k\)\}\\in\\mathbb\{R\}^\{d\_\{I\}\}denote the corresponding input histories of glucose, meal intake, and insulin delivery, respectively\. The forecasting model is posed as a nonlinear operator𝒯w,ξ:ℝdG×ℝdM×ℝdI→ℝH,\\mathcal\{T\}\_\{w,\\xi\}:\\mathbb\{R\}^\{d\_\{G\}\}\\times\\mathbb\{R\}^\{d\_\{M\}\}\\times\\mathbb\{R\}^\{d\_\{I\}\}\\rightarrow\\mathbb\{R\}^\{H\},whereξ\\xiis the architecture parameters andw∈ℝWw\\in\\mathbb\{R\}^\{W\}denotes all trainable parameters\. The predictor is then written as

G\(k\+1:k\+H\)=𝒯w,ξ\(G\(k−dG:k\),M\(k−dM:k\),I\(k−dI:k\)\),\{G\}^\{\(k\+1:k\+H\)\}=\\mathcal\{T\}\_\{w,\\xi\}\\\!\\left\(G^\{\(k\-d\_\{G\}:k\)\},\\,M^\{\(k\-d\_\{M\}:k\)\},\\,I^\{\(k\-d\_\{I\}:k\)\}\\right\),\(1\)whereG\(k\+1:k\+H\)∈ℝH\{G\}^\{\(k\+1:k\+H\)\}\\in\\mathbb\{R\}^\{H\}denotes the predicted future BG trajectory\. In this work,𝒯w,ξ\\mathcal\{T\}\_\{w,\\xi\}is implemented using an encoder–decoder temporal convolutional network \(TCN\)\[[23](https://arxiv.org/html/2606.05373#bib.bib23),[24](https://arxiv.org/html/2606.05373#bib.bib24)\], as illustrated in Figure[1](https://arxiv.org/html/2606.05373#S2.F1)\. The input is represented as a multichannel sequence with three channels corresponding toG\(k−dG:k\)G^\{\(k\-d\_\{G\}:k\)\},M\(k−dM:k\)M^\{\(k\-d\_\{M\}:k\)\}, andI\(k−dI:k\)I^\{\(k\-d\_\{I\}:k\)\}\. The encoder consists ofNNstacked temporal blocks\. Each encoder blockjjmaps an input feature tensor to an output feature tensor withcjc\_\{j\}channels and is implemented as a temporal residual block composed of two Bayesian dilated causal 1D convolutional layers \(Conv1D\) with filter sizeff, followed by ReLU nonlinearities and a residual skip connection\. When the input and output channel dimensions differ, a point\-wise1×11\\times 1Conv1D is applied on the skip path to linearly project the input channels before element\-wise addition\. This residual\-block structure is inspired by ResNets\[[25](https://arxiv.org/html/2606.05373#bib.bib25)\]and improves training of deeper TCNs by facilitating gradient flow across blocks\. The convolutional layers use causal padding so that the temporal dimension is preserved within each block\. The dilation factor in encoder blockjjis defined asδj=bj−1\\delta\_\{j\}=b^\{j\-1\}, wherebbis the dilation base\. Dilation determines the spacing between adjacent filter operations and whenδj=1\\delta\_\{j\}=1, the convolution uses consecutive time samples, while forδj\>1\\delta\_\{j\}\>1, the filter skips intermediate samples and therefore covers a wider temporal range without increasing the filter size\. The geometric growth ofδj\\delta\_\{j\}with depthjjenlarges the receptive field and enables the network to capture long\-range temporal dependencies while preserving causality\. Following each encoder block, except the last, an average pooling layer \(AvgPool\) reduces the temporal resolution, yielding a compact latent representation\. The decoder maps the latent sequence back to the forecast resolution throughNNdecoder blocks with number of channelsc~1,…,c~N\\tilde\{c\}\_\{1\},\\ldots,\\tilde\{c\}\_\{N\}\. Each decoder block follows the same residual structure, consisting of two causal Conv1D layers with filter sizeff, ReLU activations, and a residual connection with a1×11\\times 1projection when the block’s input and the second convolution’s output have different channel dimensions, so that the two tensors can be summed element\-wise in the residual connection\. In contrast to the encoder, the decoder convolutions use unit dilation \(i\.e\.,δj=1\\delta\_\{j\}=1for all decoder blocksj=1,…,Nj=1,\\ldots,N\)\. Between successive decoder blocks, upsampling layers increase the temporal resolution\. Finally, a Bayesian Conv1D layer maps the decoder feature tensor to the predicted BG trajectoryG\(k\+1:k\+H\)\{G\}^\{\(k\+1:k\+H\)\}\.

![Refer to caption](https://arxiv.org/html/2606.05373v1/x1.png)Figure 1:\(Left Panel\) Bayesian temporal convolutional network \(TCN\) architecture used for multi\-step blood\-glucose \(BG\) forecastingG\(k\+1:k\+H\)=𝒯w,ξ\(G\(k−dG:k\),M\(k−dM:k\),I\(k−dI:k\)\)\{G\}^\{\(k\+1:k\+H\)\}=\\mathcal\{T\}\_\{w,\\xi\}\\\!\(G^\{\(k\-d\_\{G\}:k\)\},\\,M^\{\(k\-d\_\{M\}:k\)\},\\,I^\{\(k\-d\_\{I\}:k\)\}\)\. The encoder containsNNtemporal residual blocks withcjc\_\{j\}channels, filter sizeff, and dilationδj=bj−1\\delta\_\{j\}=b^\{j\-1\}; average\-pooling layers between successive encoder blocks progressively reduce the temporal resolution and produce an encoded latent sequence\. The decoder consists ofNNresidual convolutional blocks withc~j\\tilde\{c\}\_\{j\}channels, unit\-dilation causal convolutions, and upsampling layers between successive decoder blocks\. \(Top\-right Panel\) Temporal residual block consisting of two Bayesian causal Conv1D layers with ReLU activations and a residual skip connection; a1×11\\times 1Conv1D is used on the skip path when the input and output channel dimensions differ\. \(Bottom\-right Panel\) The Bayesian parameterization of a causal Conv1D layer, where the convolutional weights are represented by probability distribution determined by Bayesian training\.Within this architecture family, the receptive field is a derived quantity determined byξ\\xi, indicating set of entries of the input that affect a specific entry of the output\. For the temporal blocks in Figure[1](https://arxiv.org/html/2606.05373#S2.F1), each of which contains two causal dilated convolutions, the resulting receptive field is

RF=1\+2\(f−1\)∑j=0N−1bj=1\+2\(f−1\)bN−1b−1\.\\mathrm\{RF\}=1\+2\(f\-1\)\\sum\_\{j=0\}^\{N\-1\}b^\{j\}=1\+2\(f\-1\)\\frac\{b^\{N\}\-1\}\{b\-1\}\.\(2\)The feasible architecture space is defined by the finite combinations of discrete architecture parameters \(number of blocksNN, dilation basebb, filter sizeff, number of channels in encoder\{cj\}j=1N\\\{c\_\{j\}\\\}\_\{j=1\}^\{N\}and decoder\{c~j\}j=1N\\\{\\tilde\{c\}\_\{j\}\\\}\_\{j=1\}^\{N\}, stridess, input lengths\{dG,dM,dI\}\\\{d\_\{G\},d\_\{M\},d\_\{I\}\\\}, and output horizonHH\), subject to structural and temporal constraints,

Ξ=\{ξ:RF≥τmin,f≥b\},\\Xi=\\left\\\{\\xi:\\mathrm\{RF\}\\geq\\tau\_\{\\min\},\\;f\\geq b\\right\\\},\(3\)whereτmin\\tau\_\{\\min\}denotes the minimum temporal extent required to capture relevant glucose dynamics \(e\.g\., delayed effects of meals and insulin\), andf≥bf\\geq benforces a no\-holes condition for dilated convolutions such that every input in the sequence contributes to the output\. Additional problem\-specific constraints may be imposed based on the desired prediction horizon and computational budget\. Despite these restrictions, the combinatorial structure ofξ\\xiinduces a large candidate architecture space, motivating the need for a principled architecture discovery strategy\.

### 2\.2Bayesian learning of TCN predictors

We pose the training of the TCN predictor in a Bayesian setting by placing a probability distribution over the trainable parameters, i\.e\., convolutional weightsww\. The Bayesian training is motivated by the limited and heterogeneous nature of the data, the need for quantifying uncertainty in the forecasts, and the requirement that architecture selection be robust to overconfident data fits\. In our particular problem, training is performed using population\-level data obtained from multichannel time series across multiple type 1 diabetes patients\. The TCN is trained using a sliding\-window procedure with stridess, applied over the full dataset\. The retrospective data associated with patientiiis written as

𝒟i=\{\(Gdata,i\(k−dG:k\),Mdata,i\(k−dM:k\),Idata,i\(k−dI:k\),Gdata,i\(k\+1:k\+H\)\)\}k=1NDi\.\\mathcal\{D\}\_\{i\}=\\left\\\{\\left\(G\_\{\\mathrm\{data\},i\}^\{\(k\-d\_\{G\}:k\)\},\\,M\_\{\\mathrm\{data\},i\}^\{\(k\-d\_\{M\}:k\)\},\\,I\_\{\\mathrm\{data\},i\}^\{\(k\-d\_\{I\}:k\)\},\\,G\_\{\\mathrm\{data\},i\}^\{\(k\+1:k\+H\)\}\\right\)\\right\\\}\_\{k=1\}^\{N\_\{D\_\{i\}\}\}\.\(4\)Here,Gdata,iG\_\{\\mathrm\{data\},i\}denotes the observed glucose trajectory from continuous glucose monitoring \(CGM\),Mdata,iM\_\{\\mathrm\{data\},i\}denotes the recorded meal\-related input, andIdata,iI\_\{\\mathrm\{data\},i\}denotes the recorded insulin delivery\. Each indexkkcorresponds to a sliding temporal window extracted from the trajectory of patientii, andNDiN\_\{D\_\{i\}\}denotes the total number of admissible input\-output windows obtained from that patient\. The complete population\-based training dataset constructed fromNpatientN\_\{\\mathrm\{patient\}\}patients is then given by

𝒟=⋃i=1Npatient𝒟i\.\\mathcal\{D\}=\\bigcup\_\{i=1\}^\{N\_\{\\mathrm\{patient\}\}\}\\mathcal\{D\}\_\{i\}\.\(5\)For a given architectureξ\\xi, Bayesian training is defined by a prior distribution on the weights and a likelihood function for the observed trajectories\. We use a Gaussian priorπpr\(w\|σpr,ξ\)=𝒩\(0,σpr2IW\),\\pi\_\{\\mathrm\{pr\}\}\(w\|\\sigma\_\{\\mathrm\{pr\}\},\\xi\)=\\mathcal\{N\}\\\!\\left\(0,\\sigma\_\{\\mathrm\{pr\}\}^\{2\}I\_\{W\}\\right\),and assume additive Gaussian noise model with varianceσnoise2\\sigma\_\{\\mathrm\{noise\}\}^\{2\}, leading to log\-likelihood being proportional to

log⁡πlike\(𝒟∣w,σnoise,ξ\)∝−12σnoise2∑k=1ND‖Gdata\(k\+1:k\+H\)−𝒯w,ξ\(G\(k−dG:k\),M\(k−dM:k\),I\(k−dI:k\)\)‖22,\\log\\pi\_\{\\mathrm\{like\}\}\(\\mathcal\{D\}\\mid w,\\sigma\_\{\\mathrm\{noise\}\},\\xi\)\\;\\propto\\;\-\\frac\{1\}\{2\\sigma\_\{\\mathrm\{noise\}\}^\{2\}\}\\sum\_\{k=1\}^\{N\_\{D\}\}\\left\\\|G\_\{\\mathrm\{data\}\}^\{\(k\+1:k\+H\)\}\-\\mathcal\{T\}\_\{w,\\xi\}\\\!\\big\(G^\{\(k\-d\_\{G\}:k\)\},M^\{\(k\-d\_\{M\}:k\)\},I^\{\(k\-d\_\{I\}:k\)\}\\big\)\\right\\\|\_\{2\}^\{2\},\(6\)whereσ=\(σpr,σnoise\)\\sigma=\(\\sigma\_\{\\mathrm\{pr\}\},\\sigma\_\{\\mathrm\{noise\}\}\)is called inference hyperparameters andNDN\_\{D\}denotes the number of training input–output windows over all patients\. Under these assumptions, the posterior distribution over the trainable parameters is given by

πpost\(w∣𝒟,σ,ξ\)=πlike\(𝒟∣w,σnoise,ξ\)πpr\(w∣σpr,ξ\)πevid\(𝒟∣σpr,σnoise,ξ\)\.\\pi\_\{\\mathrm\{post\}\}\(w\\mid\\mathcal\{D\},\\sigma,\\xi\)=\\frac\{\\pi\_\{\\mathrm\{like\}\}\(\\mathcal\{D\}\\mid w,\\sigma\_\{\\mathrm\{noise\}\},\\xi\)\\,\\pi\_\{\\mathrm\{pr\}\}\(w\\mid\\sigma\_\{\\mathrm\{pr\}\},\\xi\)\}\{\\pi\_\{\\mathrm\{evid\}\}\(\\mathcal\{D\}\\mid\\sigma\_\{\\mathrm\{pr\}\},\\sigma\_\{\\mathrm\{noise\}\},\\xi\)\}\.\(7\)Bayesian inference is intractable for TCNs with high\-dimensional parameters using sampling methods\. We thus employ variational inference with a tractable variational distributionqη\(w\)q\_\{\\eta\}\(w\), parameterized byη\\eta, chosen here as a mean\-field Gaussian\. The optimal variational distribution is obtained by minimizing the Kullback–Leibler divergence to the true posterior, equivalently by maximizing the evidence lower bound \(ELBO\):

ηopt=argminη\[KL\(qη\(w\)∥πpr\(w∣σpr,ξ\)\)−𝔼qη\(w\)\[logπlike\(𝒟∣w,σnoise,ξ\)\]\]\.\\eta^\{\\mathrm\{opt\}\}=\\arg\\min\_\{\\eta\}\\left\[\\mathrm\{KL\}\\\!\\left\(q\_\{\\eta\}\(w\)\\,\\\|\\,\\pi\_\{\\mathrm\{pr\}\}\(w\\mid\\sigma\_\{\\mathrm\{pr\}\},\\xi\)\\right\)\-\\mathbb\{E\}\_\{q\_\{\\eta\}\(w\)\}\\left\[\\log\\pi\_\{\\mathrm\{like\}\}\(\\mathcal\{D\}\\mid w,\\sigma\_\{\\mathrm\{noise\}\},\\xi\)\\right\]\\right\]\.\(8\)This yields an approximate posterior over the TCN weight parameters\. Given the posterior distribution overwwand fixedσ\\sigmaandξ\\xi, the predictive distribution for a new input sequence is

π\(G∗⁣\(k\+1:k\+H\)\|G\(k−dG:k\),M\(k−dM:k\),I\(k−dI:k\),𝒟,σ,ξ\)=\\displaystyle\\pi\\\!\\left\(G^\{\*\(k\+1:k\+H\)\}\\,\\middle\|\\,G^\{\(k\-d\_\{G\}:k\)\},\\,M^\{\(k\-d\_\{M\}:k\)\},\\,I^\{\(k\-d\_\{I\}:k\)\},\\,\\mathcal\{D\},\\,\\sigma,\\,\\xi\\right\)=∫π\(G∗⁣\(k\+1:k\+H\)\|G\(k−dG:k\),M\(k−dM:k\),I\(k−dI:k\),w,σ,ξ\)πpost\(w∣𝒟,σ,ξ\)dw,\\displaystyle\\int\\pi\\\!\\left\(G^\{\*\(k\+1:k\+H\)\}\\,\\middle\|\\,G^\{\(k\-d\_\{G\}:k\)\},\\,M^\{\(k\-d\_\{M\}:k\)\},\\,I^\{\(k\-d\_\{I\}:k\)\},\\,w,\\,\\sigma,\\,\\xi\\right\)\\pi\_\{\\mathrm\{post\}\}\(w\\mid\\mathcal\{D\},\\sigma,\\xi\)\\,dw,\(9\)which integrates the TCN predictions over the learned weight posterior\. In practice, this distribution is approximated via Monte Carlo sampling fromqηopt\(w\)q\_\{\\eta^\{\\mathrm\{opt\}\}\}\(w\), yielding predictive mean trajectories and corresponding uncertainty\.

### 2\.3Evidence as an architecture selection criterion

To move from single\-model Bayesian training to architecture discovery, we use the evidence distribution in \([7](https://arxiv.org/html/2606.05373#S2.E7)\) as the scoring function for selecting the inference hyperparametersσ=\(σpr,σnoise\)\\sigma=\(\\sigma\_\{\\mathrm\{pr\}\},\\sigma\_\{\\mathrm\{noise\}\}\)and for comparing candidate architecturesξ\\xi\. The evidence provides a principled trade\-off between data fit and learned parameter uncertainty\[[16](https://arxiv.org/html/2606.05373#bib.bib16),[21](https://arxiv.org/html/2606.05373#bib.bib21),[19](https://arxiv.org/html/2606.05373#bib.bib19)\], and thereby expected to favor models with robust predictive performance based on the available data\. For fixedσ\\sigmaandξ\\xi, the evidence is defined asπevid\(𝒟\|σ,ξ\)=∫πlike\(𝒟\|w,σnoise,ξ\)πpr\(w\|σpr,ξ\)𝑑w\.\\pi\_\{\\mathrm\{evid\}\}\(\\mathcal\{D\}\|\\sigma,\\xi\)=\\int\\pi\_\{\\mathrm\{like\}\}\(\\mathcal\{D\}\|w,\\sigma\_\{\\mathrm\{noise\}\},\\xi\)\\,\\pi\_\{\\mathrm\{pr\}\}\(w\|\\sigma\_\{\\mathrm\{pr\}\},\\xi\)\\,dw\.This quantity evaluates how well the architectureξ\\xi, together withσ\\sigma, explains the training data after marginalizing over parameter uncertainty\. The posterior over the inference hyperparameters is given by

πpost\(σ∣𝒟,ξ\)=πevid\(𝒟∣σ,ξ\)πpr\(σ∣ξ\)πevid\(𝒟∣ξ\)\.\\pi\_\{\\mathrm\{post\}\}\(\\sigma\\mid\\mathcal\{D\},\\xi\)=\\frac\{\\pi\_\{\\mathrm\{evid\}\}\(\\mathcal\{D\}\\mid\\sigma,\\xi\)\\,\\pi\_\{\\mathrm\{pr\}\}\(\\sigma\\mid\\xi\)\}\{\\pi\_\{\\mathrm\{evid\}\}\(\\mathcal\{D\}\\mid\\xi\)\}\.\(10\)In this work, \([10](https://arxiv.org/html/2606.05373#S2.E10)\) is approximated using a Laplace approximation, yielding a maximum a posteriori estimate ofσMAP\\sigma\_\{\\mathrm\{MAP\}\}\. Details of this approximation are provided in[A](https://arxiv.org/html/2606.05373#A1)and follow\[[21](https://arxiv.org/html/2606.05373#bib.bib21)\]\. GivenσMAP\\sigma\_\{\\mathrm\{MAP\}\}, the architecture\-level evidence is obtained by marginalizing overσ\\sigma,πevid\(𝒟\|ξ\)=∫πevid\(𝒟\|σ,ξ\)πpr\(σ\|ξ\)𝑑σ,\\pi\_\{\\mathrm\{evid\}\}\(\\mathcal\{D\}\|\\xi\)=\\int\\pi\_\{\\mathrm\{evid\}\}\(\\mathcal\{D\}\|\\sigma,\\xi\)\\,\\pi\_\{\\mathrm\{pr\}\}\(\\sigma\|\\xi\)\\,d\\sigma,which is again approximated via Laplace Approximation \(see[A](https://arxiv.org/html/2606.05373#A1)\)\. Letℳ=\{ξm\}m=1M\\mathcal\{M\}=\\\{\\xi\_\{m\}\\\}\_\{m=1\}^\{M\}denote the finite candidate architecture set\. The posterior plausibility of architectureξm\\xi\_\{m\}is defined as

ρm=πpost\(ξm∣𝒟\)=πevid\(𝒟∣ξm\)πpr\(ξm\)∑j=1Mπevid\(𝒟∣ξj\)πpr\(ξj\),m=1,…,M\.\\rho\_\{m\}=\\pi\_\{\\mathrm\{post\}\}\(\\xi\_\{m\}\\mid\\mathcal\{D\}\)=\\frac\{\\pi\_\{\\mathrm\{evid\}\}\(\\mathcal\{D\}\\mid\\xi\_\{m\}\)\\,\\pi\_\{\\mathrm\{pr\}\}\(\\xi\_\{m\}\)\}\{\\sum\_\{j=1\}^\{M\}\\pi\_\{\\mathrm\{evid\}\}\(\\mathcal\{D\}\\mid\\xi\_\{j\}\)\\,\\pi\_\{\\mathrm\{pr\}\}\(\\xi\_\{j\}\)\},\\qquad m=1,\\ldots,M\.\(11\)Under a uniform prior over architectures,ρm\\rho\_\{m\}reduces to normalized evidence\. The plausibility weights\{ρm\}m=1M\\\{\\rho\_\{m\}\\\}\_\{m=1\}^\{M\}can be used either to select the most plausible architectureξplausible=arg⁡maxξ∈Ξ⁡ρ\\xi\_\{\\mathrm\{plausible\}\}=\\arg\\max\_\{\\xi\\in\\Xi\}\\rhoor to form a plausibility\-weighted ensemble predictor:

π\(G∗⁣\(k\+1:k\+H\)\|G\(k−d:k\),M\(k−d:k\),I\(k−d:k\),𝒟\)=\\displaystyle\\pi\\\!\\left\(G^\{\*\(k\+1:k\+H\)\}\\,\\middle\|\\,G^\{\(k\-d:k\)\},\\,M^\{\(k\-d:k\)\},\\,I^\{\(k\-d:k\)\},\\,\\mathcal\{D\}\\right\)=∑m=1Mρmπ\(G∗⁣\(k\+1:k\+H\)\|G\(k−d:k\),M\(k−d:k\),I\(k−d:k\),𝒟,ξm\)\.\\displaystyle\\sum\_\{m=1\}^\{M\}\\rho\_\{m\}\\,\\pi\\\!\\left\(G^\{\*\(k\+1:k\+H\)\}\\,\\middle\|\\,G^\{\(k\-d:k\)\},\\,M^\{\(k\-d:k\)\},\\,I^\{\(k\-d:k\)\},\\,\\mathcal\{D\},\\,\\xi\_\{m\}\\right\)\.\(12\)

### 2\.4EVIDENT: Evidence\-based identification of neural architectures

EVIDENT is a iterative architecture discovery strategy that combines Bayesian training, evidence\-based ranking, and validation under uncertainty to identify TCN predictors with accurate and reliable task\-specific performance\. Rather than exhaustively searching a potentially large candidate pool, EVIDENT proceeds level by level, determining high\-plausibility regions of the architecture space and testing whether the resulting models satisfy the prescribed validation criteria\. In this way, the framework balances predictive performance and model complexity while avoiding unnecessary exploration of architectures that are either insufficiently expressive or overly parameterized\. An algorithmic summary of EVIDENT is provided in Figure[2](https://arxiv.org/html/2606.05373#S2.F2)and various components are described here\.

##### Candidate pool and capacity levels

Letℳinitial=\{ξm\}m=1M\\mathcal\{M\}\_\{\\mathrm\{initial\}\}=\\\{\\xi\_\{m\}\\\}\_\{m=1\}^\{M\}denote the finite, but potentially large, candidate architecture set induced by the feasible architecture space\. EVIDENT partitions this set into ordered capacity levels,ℳinitial=⋃l=1Lℳ\(l\),ℳ\(l\)∩ℳ\(r\)=∅forl≠r,\\mathcal\{M\}\_\{\\mathrm\{initial\}\}=\\bigcup\_\{l=1\}^\{L\}\\mathcal\{M\}^\{\(l\)\},\\;\\mathcal\{M\}^\{\(l\)\}\\cap\\mathcal\{M\}^\{\(r\)\}=\\varnothing\\ \\text\{for\}\\ l\\neq r,where the levels are arranged from lower to higher model capacity according to a prescribed complexity measure, such as the number of trainable parameters\. This construction avoids committing apriori to either overly small architectures, which may lack sufficient expressive power, or highly over\-parameterized architectures, whose parameters cannot be reliably informed by the available data\. Instead, EVIDENT searches progressively over increasingly expressive architecture classes guided by posterior plausibility\.

##### Level\-wise architecture ranking

For each levelll, every candidate architectureξ∈ℳ\(l\)\\xi\\in\\mathcal\{M\}^\{\(l\)\}is trained using the Bayesian learning procedure in \([7](https://arxiv.org/html/2606.05373#S2.E7)\)\. The corresponding posterior plausibility valuesρ\(ξ\)\\rho\(\\xi\)are then computed from \([11](https://arxiv.org/html/2606.05373#S2.E11)\), and the architectures are ranked accordingly\. Letℳ^\(l\)⊆ℳ\(l\)\\widehat\{\\mathcal\{M\}\}^\{\(l\)\}\\subseteq\\mathcal\{M\}^\{\(l\)\}denote the resulting shortlist of top\-ranked architectures\. In the simplest case,ℳ^\(l\)\\widehat\{\\mathcal\{M\}\}^\{\(l\)\}contains only the most plausible architecture in levelll; more generally, it may include several architectures with comparable evidence\. Only the architectures inℳ^\(l\)\\widehat\{\\mathcal\{M\}\}^\{\(l\)\}are then advanced to the validation process\.

EVIDENT: EVidence\-based IDEntification of Neural archiTecturesInput:Initial architecture poolℳinitial=\{ξm\}m=1M\\mathcal\{M\}\_\{\\mathrm\{initial\}\}=\\\{\\xi\_\{m\}\\\}\_\{m=1\}^\{M\}, training data𝒟\\mathcal\{D\}, validation data𝒟val\\mathcal\{D\}\_\{\\mathrm\{val\}\}, toleranceTolTol\.Output:Architectures identified as trustworthy predictors\.Procedure:1\.Partitionℳinitial\\mathcal\{M\}\_\{\\mathrm\{initial\}\}into levels\{ℳ\(l\)\}l=1L\\\{\\mathcal\{M\}^\{\(l\)\}\\\}\_\{l=1\}^\{L\}\.2\.Forl=1,…,Ll=1,\\dots,L:\(a\)Trainξ∈ℳ\(l\)\\xi\\in\\mathcal\{M\}^\{\(l\)\}on𝒟\\mathcal\{D\}via Bayesian inference\.\(b\)Rank by plausibilityρ\(ξ\)\\rho\(\\xi\)and retain top\-ranked inℳ^\(l\)⊆ℳ\(l\)\\widehat\{\\mathcal\{M\}\}^\{\(l\)\}\\subseteq\\mathcal\{M\}^\{\(l\)\}\.\(c\)Validateξ∈ℳ^\(l\)\\xi\\in\\widehat\{\\mathcal\{M\}\}^\{\(l\)\}on𝒟val\\mathcal\{D\}\_\{\\mathrm\{val\}\}:𝕕\(𝒟val,𝒯ξ\)≤Tol\.\\mathbb\{d\}\\\!\\left\(\\mathcal\{D\}\_\{\\mathrm\{val\}\},\\,\\mathcal\{T\}\_\{\\xi\}\\right\)\\leq Tol\.\(d\)Return accepted architecture\(s\) as trustworthy predictor\(s\) and terminate\.3\.If none accepted, expandℳinitial\\mathcal\{M\}\_\{\\mathrm\{initial\}\}and repeat\.

Figure 2:EVIDENT workflow\. Architectures are explored from lower to higher capacity, ranked by evidence, and validated using a task\-specific probabilistic metric𝕕\\mathbb\{d\}\(e\.g\., NLPD\)\. The procedure returns the simplest architecture\(s\) satisfying the validation criterion, yielding trustworthy predictors\.
##### Task\-specific validation under uncertainty

This stage is the central step in EVIDENT as it is not intended to measure prediction accuracy on arbitrary held\-out dataset; rather, it tests whether an architecture delivers sufficiently accurate and reliable predictions for the intended deployment regime\. Accordingly, the validation design must be problem\-specific and aligned with the target use case\. In the diabetes application considered in the next section, validation is designed to identify TCN architectures that generalize from population\-level training data to unseen individual patients and remain trustworthy for insulin\-control decisions\. Validation is performed across holdout folds\. In each fold, the posterior predictive distribution of competitive architectures, is compared against the observed trajectories of the held\-out individual patients\. Acceptance is determined using one or more probabilistic validation criteria together with a prescribed toleranceTolTol\. In this work, the primary validation metric is the negative log predictive density \(NLPD\),

NLPD=1Nval∑i=1Nval\[‖Gdata,i\(k\+1:k\+H\)−𝒯¯w,ξ\(Gdata,i\(k−dG:k\),Mdata,i\(k−dM:k\),Idata,i\(k−dI:k\)\)‖222σi2\+H2log⁡σi2\],\\mathrm\{NLPD\}=\\frac\{1\}\{N\_\{\\mathrm\{val\}\}\}\\sum\_\{i=1\}^\{N\_\{\\mathrm\{val\}\}\}\\left\[\\frac\{\\left\\\|G\_\{\\mathrm\{data\},i\}^\{\(k\+1:k\+H\)\}\-\\overline\{\\mathcal\{T\}\}\_\{w,\\xi\}\\\!\\left\(G\_\{\\mathrm\{data\},i\}^\{\(k\-d\_\{G\}:k\)\},M\_\{\\mathrm\{data\},i\}^\{\(k\-d\_\{M\}:k\)\},I\_\{\\mathrm\{data\},i\}^\{\(k\-d\_\{I\}:k\)\}\\right\)\\right\\\|\_\{2\}^\{2\}\}\{2\\sigma\_\{i\}^\{2\}\}\+\\frac\{H\}\{2\}\\log\\sigma\_\{i\}^\{2\}\\right\],\(13\)whereNvalN\_\{\\mathrm\{val\}\}is the number of validation windows in the holdout fold,𝒯¯w,ξ\\overline\{\\mathcal\{T\}\}\_\{w,\\xi\}denotes the posterior predictive mean of the TCN forecast, and

σi2=σdata2\+Var\[𝒯w,ξ\(Gdata,i\(k−dG:k\),Mdata,i\(k−dM:k\),Idata,i\(k−dI:k\)\)\]\.\\sigma\_\{i\}^\{2\}=\\sigma\_\{\\mathrm\{data\}\}^\{2\}\+\\mathrm\{Var\}\\\!\\left\[\\mathcal\{T\}\_\{w,\\xi\}\\\!\\left\(G\_\{\\mathrm\{data\},i\}^\{\(k\-d\_\{G\}:k\)\},M\_\{\\mathrm\{data\},i\}^\{\(k\-d\_\{M\}:k\)\},I\_\{\\mathrm\{data\},i\}^\{\(k\-d\_\{I\}:k\)\}\\right\)\\right\]\.\(14\)Here,σdata2\\sigma\_\{\\mathrm\{data\}\}^\{2\}represents the intrinsic uncertainty of the glucose measurements such as CGM sensor noise, while the second term captures predictive variance induced by the Bayesian TCN\. NLPD penalizes both inaccurate forecasts and unreliable predictive, making it a natural criterion for assessing probabilistic forecasting\. A candidate architecture is accepted as “trustworthy predictor” if it satisfies a prescribed tolerance, i\.e\.,NLPD≤Tol\\mathrm\{NLPD\}\\leq Tol, possibly together with additional task\-dependent validation requirements\.

##### Iterative search and stopping criteria

If at least one candidate in levelllsatisfies the validation criteria, the search terminates and EVIDENT returns the corresponding architecture as trustworthy predictor\(s\)\. Otherwise, the algorithm advances to the next level,l←l\+1l\\leftarrow l\+1, and repeats the training, architecture ranking, and validation under uncertainty cycle\. EVIDENT therefore discovers the lowest\-capacity architecture that satisfies the fit\-for\-purpose validation requirements under predictive uncertainty\. If no candidate passes validation in the final level, the candidate architecture set must be expanded\. The resulting workflow favors plausible architectures supported by the training data and validated for the intended predictive task\.

## 3Results

This section evaluates the Bayesian TCN and the proposed EVIDENT framework through two numerical studies\. We first consider a single\-patient setting to analyze how evidence varies across the TCN architecture space and whether it identifies architectures that balance expressivity and reliable forecasting\. We then apply EVIDENT to population\-based type 1 diabetes data from in silico patients to perform architecture discovery under inter\-patient variability, with the goal of identifying TCN predictors that generalize reliably to unseen subjects\. All Bayesian TCN models were implemented in PyTorch\[[26](https://arxiv.org/html/2606.05373#bib.bib26)\]and trained using the variational inference code adapted from the open\-source repository by\[[27](https://arxiv.org/html/2606.05373#bib.bib27),[28](https://arxiv.org/html/2606.05373#bib.bib28)\]\.

### 3\.1Bayesian plausibility analyses of TCN architectures

We begin with a single\-patient forecasting study to examine how Bayesian evidence responds to TCN architecture choices and whether it provides a consistent criterion for navigating the architecture space\. This includes the analyses of how evidence varies with architectural complexity and its relationship to accuracy and uncertainty in predicted blood glucose trajectories\. This experiment isolates the role of evidence as a selection metric before moving to population\-level architecture discovery in the next section\.

![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/bergman_data.png)

\(a\) ![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/meal_bergman.png)![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/insulin_bergman.png) \(b\) \(c\)

Figure 3:Single\-patient training data used in the plausibility analysis: \(a\) blood glucose \(CGM\), \(b\) meal intake, and \(c\) insulin delivery, generated using the Bergman minimal model with Dalla Man meal absorption dynamics\.Figure[3](https://arxiv.org/html/2606.05373#S3.F3)shows the dataset generated from the Bergman minimal model with meal disturbances modeled via the Dalla Man absorption model\[[29](https://arxiv.org/html/2606.05373#bib.bib29),[30](https://arxiv.org/html/2606.05373#bib.bib30)\]\. To introduce variability, meal size and timing are randomly perturbed around nominal values, while insulin is administered using a deterministic basal–bolus protocol\. Detailed model equations and parameter values are provided in the[B](https://arxiv.org/html/2606.05373#A2)\. In all TCNs, the input sequences for glucoseGG, mealMM, and insulinIIshare a common history window, i\.e\.,dG=dM=dId\_\{G\}=d\_\{M\}=d\_\{I\}in \([1](https://arxiv.org/html/2606.05373#S2.E1)\)\. All the upcoming results in this section were obtained using the same 90% train and 10% test split of the time series shown in Figure[3](https://arxiv.org/html/2606.05373#S3.F3)\. The training set corresponds to0–100000\\text\{\-\-\}10000minutes \(≈7\\approx 7days\) and the test set corresponds to10000–1152010000\\text\{\-\-\}11520minutes \(≈1\\approx 1day\)\.

![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/temp_dil_rf.png)![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/temp_dil_param.png)![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/temp_dil_evid.png)\(a\)\(b\)\(c\)
![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/pred_pointA.png)![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/pred_pointB.png)![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/pred_pointC.png)\(d\) architecture A\(e\) architecture B\(f\) architecture C

Figure 4:Evidence landscape and representative predictive behavior across the TCN architecture space\. \(a,b\) Receptive field and number of trainable parameters as a function of the number of temporal blocksNNand dilation basebb\. \(c\) Log\-evidence surface overNNandbb, showing a highly non\-uniform structure with high\-evidence architectures concentrated in an intermediate region of the architecture space\. \(d–f\) Prediction horizons for representative architectures A, B, and C selected from distinct regions of the evidence landscape in \(c\)\. In all architectures filter sizef=5f=5, strides=3s=3, encoder/decoder channels\{cj\}=\{c~j\}=\{8\}\\\{c\_\{j\}\\\}=\\\{\\tilde\{c\}\_\{j\}\\\}=\\\{8\\\}repeated across blocks, input historydG=dM=dI=1400d\_\{G\}=d\_\{M\}=d\_\{I\}=1400min, and prediction horizonH=500H=500min\.We first examine how Bayesian evidence varies across TCN architectures with different receptive fields and model capacities\. Figure[4](https://arxiv.org/html/2606.05373#S3.F4)\(a and b\) show that both the number of temporal blocksNNand the dilation basebbincrease the receptive field, while model size, measured by the number of trainable parameters, is primarily driven byNN\. The evidence landscape of the 136 architectures in Figure[4](https://arxiv.org/html/2606.05373#S3.F4)\(c\) is highly non\-uniform, with high\-evidence models concentrated in an intermediate region of the\(N,b\)\(N,b\)architecture space\. This indicates that neither larger receptive fields nor larger model sizes alone improve model plausibility\. Instead, evidence favors architectures that provide sufficient temporal context to capture meal and insulin dynamics while avoiding weakly informed parameters\. The forecast behavior of three representative architectures \(A, B, and C\) from different regions of the architecture space is shown in Figure[4](https://arxiv.org/html/2606.05373#S3.F4)\(d–f\)\. Architectures A and B, located in high\-evidence regions, produce accurate forecasts with tighter uncertainty bands\. In contrast, architecture C, selected from a low\-evidence region, yields less reliable predictions despite its larger number of temporal blocks\. These results suggest that evidence is not merely measuring goodness\-of\-fit, but also identifying architectures with improved predictive reliability under uncertainty\.

![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/stride_vs_evid.png)![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/stride_3.png)![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/stride_6.png)\(a\)\(b\)s=3s=3\(c\)s=6s=6
![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/input_output.png)\(d\)

Figure 5:\(a\) Log\-evidence as a function of the stridessused to construct overlapping input–output windows for TCN training, showing that an intermediate stride provides the highest evidence\. \(b–c\) Representative predictions for stride valuess=3s=3ands=6s=6\. \(d\) Log\-evidence surface over input history lengthdG=dM=dId\_\{G\}=d\_\{M\}=d\_\{I\}and prediction horizonHH; the coarse search over 1400 minutes localizes a high\-evidence region, which is refined using a finer grid over the highlighted 300 minute region\. For all panels, the TCN architecture is fixed atN=4N=4temporal blocks, filter sizef=5f=5, dilation baseb=3b=3, and encoder/decoder channels\{cj\}=\{c~j\}=\{12\}\\\{c\_\{j\}\\\}=\\\{\\tilde\{c\}\_\{j\}\\\}=\\\{12\\\}\. For \(a–c\), the input history and prediction horizon are fixed atdG=dM=dI=1400d\_\{G\}=d\_\{M\}=d\_\{I\}=1400min andH=1000H=1000min; for \(d\), the stride is fixed ats=3s=3\.In addition to architecture’s structural parameters, evidence also provides guidance for training and forecasting choices in TCN\. Figure[5](https://arxiv.org/html/2606.05373#S3.F5)\(a\-c\) shows the effect of stride in constructing overlapping input–output windows and the evidence surface over input lengthddand prediction horizonHH\. The evidence peaks at strides=3s=3, indicating a balance between highly redundant training windows at small stride and reduced sample efficiency at large stride\. This suggests that evidence also captures the effective information content of the training set induced by the windowing training scheme\. Figure[5](https://arxiv.org/html/2606.05373#S3.F5)\(d\) shows that evidence decreases as the prediction horizon increases, indicating that longer\-horizon forecasts are less supported by the available data\. For a fixed horizon, the evidence surface also localizes a favorable range of input history lengths rather than increasing monotonically with longer input windows\. To identify this region efficiently, we first evaluate a coarse grid over input and output lengths and then refine the high\-evidence region using a finer search\. Overall, the results of this section show that evidence can be used to identify both the structural specification of TCN architecture and the temporal configuration of the forecasting problem\.

### 3\.2EVIDENT for population\-level TCN architecture discovery

We now implement EVIDENT in a clinical forecasting setting involving inter\-patient variability and data noise from CGM sensor\. The objective is to leverage retrospective population data to identify TCN architectures that can reliably generalize to unseen individual T1D patients for predicting BG trajectory\.

#### 3\.2\.1Dataset, architecture space, and validation protocol

##### Population data and patient\-wise folds

Population\-level type 1 diabetes \(T1D\) trajectories are generated using the UVA/Padova T1D simulator\[[31](https://arxiv.org/html/2606.05373#bib.bib31)\]\. The dataset consists of 10 adult in silico subjects, each providing continuous glucose monitoring \(CGM\) data together with the corresponding meal and insulin inputs, as shown in Figure[6](https://arxiv.org/html/2606.05373#S3.F6)\. To evaluate generalization under inter\-patient variability, we use a 5\-fold cross\-validation protocol: in each fold, 8 patients are used for population training and the remaining 2 subjects are held out for patient\-specific validation, as summarized in Table[1](https://arxiv.org/html/2606.05373#S3.T1)\.

![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/fda_data.png)Figure 6:Blood\-glucose trajectories for 10 adult in silico subjects generated using the UVA/Padova T1D simulator\[[31](https://arxiv.org/html/2606.05373#bib.bib31)\]\.Table 1:Patient\-wise 5\-fold cross\-validation splits for the 10 adult in silico patients\.
##### Candidate architecture pool

The candidate TCN pool is constructed from the feasible architecture space in \([3](https://arxiv.org/html/2606.05373#S2.E3)\)\. To ensure adequate temporal context for glucose dynamics, architectures are constrained to satisfy1day≤RF\(ξ\)≤2000min,1~\\text\{day\}\\leq\\mathrm\{RF\}\(\\xi\)\\leq 2000~\\text\{min\},where the lower bound ensures access to at least one daily cycle and the upper bound avoids unnecessarily long temporal context\. We also enforce the no\-holes conditionf≥bf\\geq bfor dilated convolutions\. Additional implementation bounds on filter size, depth, and channel width are imposed to keep the search computationally tractable\. These constraints yield a finite pool of 61 candidate TCN architectures\. For EVIDENT, the candidates are partitioned into six ordered capacity levels according to the number of trainable parameters, as summarized in TableLABEL:tab:arch\_level\. Full architecture specifications are provided in[C](https://arxiv.org/html/2606.05373#A3)\.

##### Validation protocol

The training and validation protocol is designed to emulate deployment to unseen T1D subjects with limited subject\-specific data\. In each cross\-validation fold, candidate architectures are first trained on population data using input–output windows extracted from inter\-meal intervals\. The output horizon is set to 285 minutes, corresponding to the minimum inter\-meal gap, and windows containing meal events within the prediction horizon are excluded\. For each held\-out subject, the posterior distribution obtained from population training is subsequently used as the prior for subject\-specific adaptation using four days of that subject’s data\. Validation is then performed using rolling one\-hour prediction windows shifted to cover the full day 5, while excluding windows intersecting meal events and truncating predictions 10 minutes before meal onset to avoid scoring across unobserved disturbances\. A candidate architecture is considered acceptable if its NLPD remains below the prescribed toleranceTol=14\.5Tol=14\.5across all one\-hour validation windows for both held\-out subjects\.

Table 2:Candidate TCN architectures organized into six capacity levels based on the number of trainable parameters\. Each row reports the range of architecture hyperparameters used in levelll: the number of temporal blocksNN, filter sizeff, dilation basebb, and starting number of encoder channelsc1c\_\{1\}\. The dilation factor in encoder blockjjis defined asδj=bj−1\\delta\_\{j\}=b^\{j\-1\}, and the number of encoder channels follows a doubling schedulecj=2j−1c1c\_\{j\}=2^\{j\-1\}c\_\{1\}forj=1,…,Nj=1,\\dots,N\. The decoder channels\{c~j\}j=1N\\\{\\tilde\{c\}\_\{j\}\\\}\_\{j=1\}^\{N\}mirrors the encoder channels in reverse order\.Capacity levelllParameter rangeNumber of architecturesNumber of temporal blocksNNFilter sizeffDilation basebbStarting number of channelsc1c\_\{1\}1≤30\\leq 30k10\[3−4\]\[3\-4\]\[7−13\]\[7\-13\]\[4−9\]\[4\-9\]\[2,4\]\[2,4\]23030k–100100k9\[3−6\]\[3\-6\]\[3−13\]\[3\-13\]\[3−9\]\[3\-9\]\[2,8\]\[2,8\]3100100k–400400k10\[3−6\]\[3\-6\]\[3−13\]\[3\-13\]\[2−9\]\[2\-9\]\[2,16\]\[2,16\]4400400k–11M11\[3−7\]\[3\-7\]\[3−13\]\[3\-13\]\[2−9\]\[2\-9\]\[2,32\]\[2,32\]511M–55M11\[3−7\]\[3\-7\]\[3−13\]\[3\-13\]\[2−9\]\[2\-9\]\[4,64\]\[4,64\]6≥5\\geq 5M10\[3−7\]\[3\-7\]\[7−13\]\[7\-13\]\[2−9\]\[2\-9\]\[8,128\]\[8,128\]

#### 3\.2\.2Level\-wise architecture discovery results

##### Capacity Level 1 \(l=1l=1\)

The lowest\-capacity TCN candidates with less than3×1043\\times 10^\{4\}parameters, that still satisfy the receptive\-field feasibility constraints, are in Level 1\. As shown in Figure[7](https://arxiv.org/html/2606.05373#S3.F7), model evidence is not uniformly distributed within this level and it concentrates on a small subset of architectures in each held\-out validation fold\. Despite this, none of the Level 1 architectures satisfy the validation criterion\. As illustrated in Figure[7](https://arxiv.org/html/2606.05373#S3.F7), predictions on held\-out subjects frequently violate the NLPD tolerance across one\-hour windows \(shown in red\)\. These failures indicate that, although some architectures are favored by the population\-based training data, their representational capacity is insufficient to generalize to subject\-specific glucose dynamics\. Consequently, EVIDENT rejects Level 1 and proceeds to the next capacity level\.

![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/cat1_fold1.png)

![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/cat1_fold2.png)

![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/cat1_fold3.png)

![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/cat1_fold4.png)

![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/cat1_fold5.png)

\(a\) ![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/cat1_p2_c1.png)![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/cat1_p3_c1.png) \(b\) \(c\)

Figure 7:Level 1 architecture ranking and validation outcome\. \(a\) Posterior plausibilityρ\\rhoversus model size \(number of parameters\) for Level 1 architectures across validation folds \(left to right: folds 1–5\)\. \(b,c\) Representative validation failure for most plausible architectures on held\-out subjects: \(b\) Patient 2 from fold 3, Architecture:Level1Arch4Level\_\{1\}Arch\_\{4\}, \(c\) Patient 3 from fold 4, Architecture:Level1Arch6Level\_\{1\}Arch\_\{6\}\. Red segments indicate one\-hour prediction windows that violates the acceptance criterionNLPD≤TolNLPD\\leq Tol\.
##### Capacity Level 2 \(l=2l=2\)

Level 2 contains the intermediate\-capacity regime \(3×104−1×1053\\times 10^\{4\}\-1\\times 10^\{5\}parameters\) in which EVIDENT identifies architectures that are both plausible and validation\-acceptable\. Figure[8](https://arxiv.org/html/2606.05373#S3.F8)\(a\) summarizes, for each held\-out validation fold, the variation of posterior plausibilityρ\\rhoand the corresponding MAP estimate ofσnoise\\sigma\_\{\\text\{noise\}\}across the candidate architectures in this level\. It is observed that a small subset of architectures receives substantial posterior model plausibility, while the most plausible architecture varies across folds, reflecting the inter\-patient heterogeneity of the population\-based training data in Figure[6](https://arxiv.org/html/2606.05373#S3.F6)\. While most subjects exhibit comparable glucose ranges and excursion patterns, Patient 7 and Patient 5 display substantially larger amplitudes, which changes the temporal structure seen during training\. As a result, EVIDENT favors different architectures across folds\. Nevertheless,Level2Arch4Level\_\{2\}Arch\_\{4\}with 43268 parameters, which consists of 5 temporal blocks with dilation baseb=3b=3, filter sizef=9f=9, and encoder channels\[2,4,8,16,32\]\[2,4,8,16,32\], emerges as a competitive architecture in three validation folds\. This architecture achieves a one\-day receptive field through gradual dilation growth across depth, while retaining sufficient local temporal resolution to capture meal\-driven glucose excursions\. TheσnoiseMAP\\sigma\_\{\{\\rm noise\}\_\{MAP\}\}trends provide a diagnostic such that low\-plausibility architectures require inflated noise levels, indicating that unresolved temporal structure is absorbed as noise, whereas competitive models achieve high plausibility without such inflation\.

![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/cat2_fold1.png)

![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/cat2_fold2.png)

![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/cat2_fold3.png)

![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/cat2_fold4.png)

![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/cat2_fold5.png)

![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/cat2_fold1_noise.png)

![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/cat2_fold2_noise.png)

![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/cat2_fold3_noise.png)

![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/cat2_fold4_noise.png)

![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/cat2_fold5_noise.png)

\(a\) ![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/cat2_p9_c2.png)![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/cat2_p10_c2.png)

\(b\) \(c\) ![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/cat2_p1_c2.png)![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/cat2_p2_c2.png)

\(d\) \(e\)

![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/cat2_p3_c2.png)

![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/cat2_p4_c2.png)

\(f\) \(g\)

Figure 8:Level 2 architecture ranking and validation outcomes\. \(a\) Posterior plausibilityρ\\rhoand MAP estimate ofσnoise\\sigma\_\{\\text\{noise\}\}as functions of model size \(number of trainable parameters\) for Level 2 architectures across validation folds \(left to right: folds 1–5\)\. \(b, c\) Representative validation failures for most plausible architecture \(Level2Arch2Level\_\{2\}Arch\_\{2\}\) on held\-out patients 9 and 10 from Fold 1\. Red segments indicate one\-hour prediction windows that violate the acceptance criterionNLPD≤Tol\\mathrm\{NLPD\}\\leq Tol\. \(d–g\) Validation results for the EVIDENT\-selected architecture \(Level2Arch4Level\_\{2\}Arch\_\{4\}\) across multiple folds and held\-out patients: \(d,e\) patients 1–2 and \(f,g\) patients 3–4\. The one\-hour forecasts of \(Level2Arch4Level\_\{2\}Arch\_\{4\}\) passes all the validation tests, demonstrating EVIDENT identified trustworthy predictor\.The top\-ranked architectures in each fold are subsequently evaluated on the held\-out subjects using the validation protocol\. Although Level 2 architectures exhibit improved predictive performance relative to Level 1, most candidates still fail the validation criterion by violating the NLPD tolerance in a subset of one\-hour prediction windows \(representative failure cases are shown in Figure[8](https://arxiv.org/html/2606.05373#S3.F8)\(b,c\)\)\. EVIDENT, however, identifies a single architecture,Level2Arch4Level\_\{2\}Arch\_\{4\}, that satisfies the validation criterion across all folds and held\-out subjects\. As shown in Figure[8](https://arxiv.org/html/2606.05373#S3.F8)\(d–g\), this model produces consistent one\-hour forecasts that remain within the prescribed tolerance across all validation windows\. This architecture is therefore selected as the trustworthy predictor\.

![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/cat2_p6_c2.png)

![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/cat2_p7_c2.png)

\(a\) \(b\) ![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/cat2_p6_bma.png)![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/cat2_p7_bma.png)

\(c\) \(d\)

Figure 9:Comparison of single\-architecture and plausibility\-weighted ensemble predictions for representative Level 2 candidates in fold 2\. \(a–\-b\) Predictions from a single architecture \(Level2Arch7Level\_\{2\}Arch\_\{7\}\) for Patients 6–7 in fold 2, and \(c–\-d\) corresponding ensemble predictions from all architectures in Level 2\. Red segments indicate one\-hour prediction windows that violates the acceptance criterionNLPD≤TolNLPD\\leq Tol\.In addition to single\-architecture selection, EVIDENT also supports a plausibility\-weighted ensemble predictor when multiple architectures remain competitive\. To assess this, we compare the predictions of the most plausible architecture with the ensemble prediction defined in \([12](https://arxiv.org/html/2606.05373#S2.E12)\) over all Level 2 architectures for fold 2 \(Figure[9](https://arxiv.org/html/2606.05373#S3.F9)\)\. The ensemble achieves improved predictive accuracy and lower NLPD relative to individual architectures, while producing wider uncertainty bands due to aggregation across models\. This behavior reflects an ensemble effect, where different architectures capture complementary temporal patterns and their combination mitigates individual model deficiencies\. These results indicate that, although a single architecture can satisfy the validation criterion, the plausibility\-weighted ensemble provides a mechanism for improving predictive performance while capturing model uncertainty in the presence of multiple candidates\.

##### Capacity level 3 \(l=3l=3\)

Level 3 contains higher\-capacity architectures \(10510^\{5\}–4×1054\\times 10^\{5\}parameters\)\. Despite their increased representational capacity, these models do not improve validation performance and frequently violate the NLPD criterion across multiple prediction windows \(Figure[10](https://arxiv.org/html/2606.05373#S3.F10)\)\. The additional parameters at this level are not sufficiently learned by the available data, leading to degraded generalization under inter\-patient variability\. In contrast to Level 2, where intermediate\-capacity architectures satisfy the validation criterion, increasing capacity beyond this regime yields no consistent improvement and introduces prediction errors\.

![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/cat3_p1_c3.png)

![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/cat3_p4_c3.png)

\(a\) \(b\)

Figure 10:Representative validation failures for Level 3 architectures on held\-out patients\. \(a\) Patient 1 from fold 3 usingLevel3Arch10Level\_\{3\}Arch\_\{10\}and \(b\) Patient 4 from fold 4 usingLevel3Arch4Level\_\{3\}Arch\_\{4\}\. Red segments indicate one\-hour prediction windows that violate the acceptance criterionNLPD≤TolNLPD\\leq Tol\.
##### Clinical risk mapping

To complement NLPD\-based validation, we evaluate representative rejected and accepted architectures using the Parkes error grid\[[32](https://arxiv.org/html/2606.05373#bib.bib32),[33](https://arxiv.org/html/2606.05373#bib.bib33)\], a clinically established mapping for assessing the impact of glucose prediction errors on treatment decisions\. The grid partitions errors into risk zones including no clinical impact \(Zone A\), minimal clinical consequence \(Zone B\), incorrect action affecting clinical outcome \(Zone C\), significant medical risk \(Zone D\), and to dangerous treatment decisions \(Zone E\)\. Figure[13](https://arxiv.org/html/2606.05373#S3.F13)compares a rejected Level 1 architecture with the selected Level 2 model\. To account for uncertainty, we overlay uncertainty bands from both CGM measurements noise and TCN predictions\. While the mean predictions of the Level 1 model appear reasonable, its uncertainty spreads into higher\-risk regions \(primarily Zone B\), indicating potential for clinically suboptimal decisions\. In contrast, the selected Level 2 architecture concentrates predictions and associated uncertainty within Zone A, indicating clinically safe behavior\. This comparison confirms that EVIDENT’s validation under uncertainty step also aligned with clinically meaningful risk criteria\.

![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/error_grid_bad.png)

Figure 11:\*\(a\)

![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/error_grid_good.png)

Figure 12:\*\(b\)

Figure 13:Clinical risk assessment using the Parkes error grid for representative rejected and accepted architectures by EVIDENT\. \(a\) Rejected Level 1 architecture \(Level1Arch1Level\_\{1\}Arch\_\{1\}\) shows uncertainty extending into higher\-risk Zone B\. \(b\) Selected Level 2 architecture \(Level2Arch4Level\_\{2\}Arch\_\{4\}\) concentrates predictions within Zone A, corresponding to clinically safe decisions\.

### 3\.3Random\-search baseline

To provide a baseline for architecture discovery, we compared EVIDENT against random search with similar budget\. Random search is a standard baseline for hyperparameter and architecture selection because it provides a non\-adaptive sampling strategy and has been shown to be competitive with grid search under fixed computational budgets\[[34](https://arxiv.org/html/2606.05373#bib.bib34)\]\. In contrast to EVIDENT, which progressively narrows the candidate space through evidence\-guided ranking and validation across ordered capacity levels, random search samples architectures uniformly from the feasible pool and selects models solely based on held\-out performance\.

The baseline study uses the same pool of 61 feasible TCN architectures and the same population\-based to subject\-specific prediction protocol employed in the EVIDENT experiments\. To approximately match the computational budget required by EVIDENT up to acceptance, random search samples 19 architectures uniformly without replacement from the feasible candidate pool, corresponding to the total number of architectures evaluated in Levels 1 and 2\. The sampled architectures are listed in[C](https://arxiv.org/html/2606.05373#A3)where to distinguish incomplete architecture coverage from the selection rule itself, we considered a random\-search subset that included the EVIDENT\-selected architecture, denoted here asRandArch6=Level2Arch4RandArch\_\{6\}=Level\_\{2\}Arch\_\{4\}\. Each sampled architecture is first trained on the population training subjects within the corresponding fold\. For each held\-out subject, the resulting population posterior is then used as the prior for subject\-specific adaptation using three days of patient data\. Architecture selection is subsequently performed on day 4 using rolling one\-hour prediction windows, where the selected architecture minimizes the NLPD across the held\-out subjects\. Consistent with the EVIDENT protocol, day 5 is excluded from model selection and reserved exclusively for final comparison with EVIDENT results\.

![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/cat3_d5f7_random_p3.png)

Figure 14:\*\(a\)

![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/cat3_d5f7_random_p4.png)

Figure 15:\*\(b\)

![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/RMSE_rand.png)

Figure 16:\*\(c\)

![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/CV_rand.png)

Figure 17:\*\(d\)

![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/parkes_p3_rand.png)

Figure 18:\*\(e\)

![Refer to caption](https://arxiv.org/html/2606.05373v1/Figures_manuscript/parkes_rand_p4.png)

Figure 19:\*\(f\)

Figure 20:Performance comparison between the random\-search\-selected architectureRandArch8RandArch\_\{8\}and the EVIDENT\-selected architectureLevel2Arch4Level\_\{2\}Arch\_\{4\}for the held\-out patients in Fold 4\. \(a,b\) show the 1\-hour rolling glucose predictions ofRandArch8RandArch\_\{8\}for patients 3 and 4, respectively\. \(c\) compares the corresponding rolling\-window RMSE, while \(d\) compares the predictive coefficient of variation \(%CV\)\. \(e,f\) show the Parkes error grid analysis forRandArch8RandArch\_\{8\}\.The random\-search criterion selectsRandArch8RandArch\_\{8\}due to lowest average day 4 validation NLPD for held\-out patients even thoughRandArch6=Level2Arch4RandArch\_\{6\}=Level\_\{2\}Arch\_\{4\}is in the 19 randomly selected architectures in TableLABEL:tab:random\.RandArch8RandArch\_\{8\}contains 133424 trainable parameters, withN=4N=4,f=7f=7,b=5b=5, encoder channels\[8,16,32,64\]\[8,16,32,64\], andRF=1873\\mathrm\{RF\}=1873\. In comparison,Level2Arch4Level\_\{2\}Arch\_\{4\}contains 43268 trainable parameters and has a similar receptive field,RF=1937\\mathrm\{RF\}=1937; its full architectural specification is reported in TableLABEL:tab:level2\. Thus, the two models have comparable temporal coverage, butRandArch8RandArch\_\{8\}achieves it through a larger dilation base, wider filter size and more number of channels, leading to approximately three times more trainable parameters\.

Figure[20](https://arxiv.org/html/2606.05373#S3.F20)compares the 1\-hour rolling predictions ofRandArch8RandArch\_\{8\}for the held\-out patients 3 and 4 in Fold 4\. For the diagnostic comparison in this figure, we also report the rolling\-window root mean\-squared error \(RMSE\) and coefficient of variation \(CV\), representing accuracy and reliability of TCN forecast respectively,

RMSE=\[1H‖Gdata\(k\+1:k\+H\)−𝒯¯w,ξ\(Gdata\(k−dG:k\),Mdata\(k−dM:k\),Idata\(k−dI:k\)\)‖22\]1/2,\\mathrm\{RMSE\}=\\left\[\\frac\{1\}\{H\}\\left\\\|G\_\{\\mathrm\{data\}\}^\{\(k\+1:k\+H\)\}\-\\overline\{\\mathcal\{T\}\}\_\{w,\\xi\}\\\!\\left\(G\_\{\\mathrm\{data\}\}^\{\(k\-d\_\{G\}:k\)\},M\_\{\\mathrm\{data\}\}^\{\(k\-d\_\{M\}:k\)\},I\_\{\\mathrm\{data\}\}^\{\(k\-d\_\{I\}:k\)\}\\right\)\\right\\\|\_\{2\}^\{2\}\\right\]^\{1/2\},\(15\)and

CV=100H∑h=1HVar\[𝒯w,ξ\(k\+h\)\(Gdata\(k−dG:k\),Mdata\(k−dM:k\),Idata\(k−dI:k\)\)\]\|𝒯¯w,ξ\(k\+h\)\(Gdata\(k−dG:k\),Mdata\(k−dM:k\),Idata\(k−dI:k\)\)\|\.\\mathrm\{CV\}=\\frac\{100\}\{H\}\\sum\_\{h=1\}^\{H\}\\frac\{\\sqrt\{\\mathrm\{Var\}\\\!\\left\[\\mathcal\{T\}\_\{w,\\xi\}^\{\(k\+h\)\}\\\!\\left\(G\_\{\\mathrm\{data\}\}^\{\(k\-d\_\{G\}:k\)\},M\_\{\\mathrm\{data\}\}^\{\(k\-d\_\{M\}:k\)\},I\_\{\\mathrm\{data\}\}^\{\(k\-d\_\{I\}:k\)\}\\right\)\\right\]\}\}\{\\left\|\\overline\{\\mathcal\{T\}\}\_\{w,\\xi\}^\{\(k\+h\)\}\\\!\\left\(G\_\{\\mathrm\{data\}\}^\{\(k\-d\_\{G\}:k\)\},M\_\{\\mathrm\{data\}\}^\{\(k\-d\_\{M\}:k\)\},I\_\{\\mathrm\{data\}\}^\{\(k\-d\_\{I\}:k\)\}\\right\)\\right\|\}\.\(16\)Here,hhindexes the forecast step within theHH\-step prediction horizon,𝒯¯w,ξ\\overline\{\\mathcal\{T\}\}\_\{w,\\xi\}is the posterior predictive mean, andVar\[𝒯w,ξ\(k\+ℓ\)\]\\mathrm\{Var\}\[\\mathcal\{T\}\_\{w,\\xi\}^\{\(k\+\\ell\)\}\]is the posterior predictive variance at forecast stepk\+ℓk\+\\ell\. The rolling\-window RMSE and CV show that this additional capacity does not translate into more reliable validation performance\. For patient 3,RandArch8RandArch\_\{8\}exhibits larger RMSE over several one\-hour windows and higher predictive CV, resulting in failure under the NLPD validation criterion\. For patient 4,RandArch8RandArch\_\{8\}attains lower RMSE in some windows, but this subject\-specific improvement is not consistent across the held\-out set\. In contrast,Level2Arch4Level\_\{2\}Arch\_\{4\}satisfies the NLPD criterion for both held\-out patients while using a substantially smaller architecture\. These results demonstrate that EVIDENT improves the reliability of architecture discovery compared to random search, not because random sampling may fail to include the appropriate architecture, but because architecture selection based solely on held\-out validation performance is not sufficiently robust for future\-day probabilistic forecasting\. By coupling Bayesian model evidence guided search and held\-out validation, EVIDENT avoids selecting larger architectures whose forecast does not generalize consistently to unseen subjects\.

## 4Discussion

The primary methodological insight of this work is that Bayesian model evidence provides a natural criterion for architecture and prior selection, particularly in data\-limited and heterogeneous settings\. In practice, both quantities are approximated\. For example, the posterior over network weights is obtained through variational inference, the evidence is evaluated using a Laplace approximation, and the search is restricted to a finite, pre\-defined pool of feasible TCN architectures\. Across this finite TCN candidate pool, the approximate evidence landscape is highly non\-uniform and concentrates on a narrow intermediate\-capacity regime, rather than favoring architectures with the largest receptive field or parameter count\. This behavior suggests that strong predictive performance in this setting arises from a balance between temporal expressivity and the degree to which model parameters are learned from the available data\. At the same time, the results demonstrate that evidence alone is insufficient for accepting a model as an accurate and reliable predictor under uncertainty\. Because the evidence is evaluated on the training data used for model construction, it can still favor architectures that explain that data well without satisfying the desired generalization behavior on unseen subjects\. This observation motivates a clear separation between model ranking and model validation with quantified uncertainty\. Within EVIDENT, approximate Bayesian evidence is used to rank candidate architectures within a prescribed search space, while task\-specific validation under predictive uncertainty determines whether a candidate satisfies the requirements of the intended forecasting task\. Consequently, EVIDENT identifies architectures that remain credible under validation under uncertainty, rather than selecting models solely based on training or validation error\. Comparison against a random\-search baseline further highlights the importance of separating evidence\-guided ranking from validation\. In particular, selection based only on held\-out validation performance can favor larger architectures that achieve competitive short\-term accuracy but exhibit reduced robustness and less reliable generalization to unseen subjects\.

Beyond the glucose forecasting application, EVIDENT can be applied to neural forecasting problems characterized by noisy measurements, limited data, and heterogeneous operating conditions\. In such settings, architecture selection is better viewed as a model selection problem under task\-specific constraints, rather than a purely hyperparameter optimization\. From this perspective, EVIDENT contributes two key elements\. First, the level\-wise search imposes structure on the otherwise large combinatorial architecture space by progressively exploring increasingly expressive model classes, reducing the need for across all candidates at once\. Second, the framework couples the evidence\-based ranking with task\-specific validation, so that model acceptance depends on predictive accuracy and reliability for the target task rather than training fit alone\. The plausibility\-weighted ensemble results further suggest that when approximate evidence is distributed across multiple competitive architectures, EVIDENT can be extended from selecting a single model to combining a small set of candidates, improving predictive robustness while better representing prediction uncertainty\. To this end, this work positions architecture discovery as a principled selection process that balances the model capacity and task\-dependent evaluation\. We emphasize that, the EVIDENT is explicitly task\-specific that the discovered architecture is validated for a particular forecasting context, not as a universally optimal model\.

Several limitations of the present study demands extension and improvement for future works\. First, the current implementation evaluates architectures over a finite, discrete candidate pool defined by receptive\-field and structural constraints tailored to glucose dynamics\. While this restriction provides an interpretable architecture pool, it is not intrinsic to the EVIDENT framework\. The same evidence\-guided ranking and validation strategy can be applied to continuous or hybrid architecture spaces by integrating with standard architecture search methods \(e\.g\.,\[[7](https://arxiv.org/html/2606.05373#bib.bib7),[4](https://arxiv.org/html/2606.05373#bib.bib4)\]\) to propose candidates within and across capacity levels\. Moreover, the Bayesian formulation employed in this work relies on several approximations\. The posterior over network weights is approximated using mean\-field variational inference, while the model evidence is estimated through a Laplace approximation based on the calculated variational posterior\. Although these approximations provide a computationally tractable alternative to exact Bayesian inference, they can influence the resulting plausibility landscape and, consequently, the ranking of candidate architectures\. Future work will investigate more expressive posterior representations, including Markov chain Monte Carlo methods and richer variational families, e\.g\.,\[[35](https://arxiv.org/html/2606.05373#bib.bib35)\], as well as alternative evidence estimators that may improve the robustness and stability of architecture ranking\. Finally, the type 1 diabetes study is based on a small number patients and a relatively small cohort, which is appropriate for methodology development but does not substitute for evaluation on real clinical datasets with more complex sources of variability, CGM sensor noise, and recording meals\.

## 5Conclusions

This paper introduced EVIDENT, a framework for neural architecture discovery that systematically integrates Bayesian learning, evidence\-based ranking, and task\-specific validation to identify the lowest\-capacity, i\.e\., most parsimonious, architecture that is both supported by the available data and satisfies the prescribed validation criterion thus delivers reliable forecasting under uncertain time\-series data\. Applied to Bayesian TCNs for subject\-specific blood glucose forecasting, the results show that validation\-accepted predictors lie in an intermediate\-capacity regime, while both under\- and over\-parameterized models fail to generalize to unseen patients\. These findings highlight the importance of combining evidence\-guided ranking with task\-specific validation for architecture design in limited and uncertain data for consequential forecasting settings\.

## Acknowledgments

Md Azharul Islam, Tarunraj Singh, and Danial Faghihi acknowledge support from the National Science Foundation \(NSF\) under award number DMS\-2533946\. The authors also acknowledge computational support provided by the Center for Computational Research at the University at Buffalo\.

## Funding

This work was supported by the NSF under award number DMS\-2533946\. NSF had no role in the analysis, interpretation of results, writing of the manuscript, or decision to submit the article for publication\.

## Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper\.

## Data and code availability

The data used in this study were generated from the Bergman minimal model and the UVA/Padova Type 1 Diabetes simulator, as described in the manuscript\. The generated datasets and implementation code will be made available in a public GitHub repository upon publication\. During peer review, they are available from the corresponding author upon request\.

## Declaration of generative AI in the manuscript preparation process

During the preparation of this work, the authors used Grammarly and ChatGPT to assist with language editing, clarity, and readability\. After using this tool, the authors reviewed and edited the manuscript as needed and take full responsibility for the content of the published article\.

## Appendix AEvidence approximation details

This appendix summarizes the Laplace approximations used to evaluate the evidence terms introduced in Section[2\.3](https://arxiv.org/html/2606.05373#S2.SS3)\. For a fixed architectureξ\\xiand inference hyperparametersσ=\(σpr,σnoise\)\\sigma=\(\\sigma\_\{\\mathrm\{pr\}\},\\sigma\_\{\\mathrm\{noise\}\}\), the evidence distribution in \([7](https://arxiv.org/html/2606.05373#S2.E7)\) is approximated by

log⁡πevid\(𝒟∣σ,ξ\)≈\\displaystyle\\log\\pi\_\{\\mathrm\{evid\}\}\(\\mathcal\{D\}\\mid\\sigma,\\xi\)\\approx\\;−12σnoise2∑k=1ND‖Gdata\(k\+1:k\+H\)−𝒯wMAP,ξ\(Gdata\(k−dG:k\),Mdata\(k−dM:k\),Idata\(k−dI:k\)\)‖22\\displaystyle\-\\frac\{1\}\{2\\sigma\_\{\\mathrm\{noise\}\}^\{2\}\}\\sum\_\{k=1\}^\{N\_\{D\}\}\\left\\\|G\_\{\\mathrm\{data\}\}^\{\(k\+1:k\+H\)\}\-\\mathcal\{T\}\_\{w\_\{\\mathrm\{MAP\}\},\\xi\}\\\!\\big\(G\_\{\\mathrm\{data\}\}^\{\(k\-d\_\{G\}:k\)\},M\_\{\\mathrm\{data\}\}^\{\(k\-d\_\{M\}:k\)\},I\_\{\\mathrm\{data\}\}^\{\(k\-d\_\{I\}:k\)\}\\big\)\\right\\\|\_\{2\}^\{2\}−NDH2log⁡\(2πσnoise2\)−12σpr2‖wMAP‖22−W2log⁡\(2πσpr2\)\\displaystyle\-\\frac\{N\_\{D\}H\}\{2\}\\log\\\!\\big\(2\\pi\\sigma\_\{\\mathrm\{noise\}\}^\{2\}\\big\)\-\\frac\{1\}\{2\\sigma\_\{\\mathrm\{pr\}\}^\{2\}\}\\\|w\_\{\\mathrm\{MAP\}\}\\\|\_\{2\}^\{2\}\-\\frac\{W\}\{2\}\\log\\\!\\big\(2\\pi\\sigma\_\{\\mathrm\{pr\}\}^\{2\}\\big\)−12logdet\(P2π\)\.\\displaystyle\-\\frac\{1\}\{2\}\\log\\det\\\!\\left\(\\frac\{P\}\{2\\pi\}\\right\)\.\(17\)Here,NDN\_\{D\}denotes the number of training input–output windows andHHthe prediction horizon, so thatNDHN\_\{D\}His the total number of blood glucose observations\. The data\-misfit term is evaluated atwMAPw\_\{\\mathrm\{MAP\}\}, approximated by the mean of the variational posterior\. The matrixPPdenotes the Hessian of the negative log\-posterior with respect to the trainable weights\. Rather than computingPPexplicitly, we approximateP−1P^\{\-1\}using the diagonal covariance of the mean\-field variational posterior, yieldinglogdet\(P\)\\log\\det\(P\)as the negative sum of the logarithms of the posterior variances\.

The evidence distribution in \([10](https://arxiv.org/html/2606.05373#S2.E10)\) for architecture comparison under Laplace approximation is obtained by marginalizing the inference hyperparameters around their MAP estimate:

πevid\(𝒟∣ξ\)≈πevid\(𝒟∣σMAP,ξ\)πpr\(σMAP∣ξ\)\[2πVar\(σpr2\)Var\(σnoise2\)\]\.\\pi\_\{\\mathrm\{evid\}\}\(\\mathcal\{D\}\\mid\\xi\)\\approx\\pi\_\{\\mathrm\{evid\}\}\(\\mathcal\{D\}\\mid\\sigma\_\{\\mathrm\{MAP\}\},\\xi\)\\,\\pi\_\{\\mathrm\{pr\}\}\(\\sigma\_\{\\mathrm\{MAP\}\}\\mid\\xi\)\\,\\Big\[2\\pi\\sqrt\{\\mathrm\{Var\}\(\\sigma\_\{\\mathrm\{pr\}\}^\{2\}\)\\,\\mathrm\{Var\}\(\\sigma\_\{\\mathrm\{noise\}\}^\{2\}\)\}\\Big\]\.\(18\)The variances in \([18](https://arxiv.org/html/2606.05373#A1.E18)\) correspond to the Gaussian approximation of the posterior over the inference hyperparameters\.

## Appendix BSingle\-patient data generation model

This appendix provides the simulator equations and parameter values used to generate the single\-patient dataset in Figure[3](https://arxiv.org/html/2606.05373#S3.F3)\. The Bergman states are plasma glucoseG\(t\)G\(t\)\[mg/dL\], remote insulin actionX\(t\)X\(t\)\[min\-1\], and plasma insulinI\(t\)I\(t\)\[mU/L\], with basal valuesGbG\_\{b\}andIbI\_\{b\}\. For a Type\-1 diabetic \(T1D\) subject, endogenous pancreatic secretion is removed and insulin is administered externally\. The stabilized T1D model is

G˙\(t\)\\displaystyle\\dot\{G\}\(t\)=−\(X\(t\)\+p1\)G\(t\)\+p1Gb\+Rag\(t\)Vg,\\displaystyle=\-\\big\(X\(t\)\+p\_\{1\}\\big\)\\,G\(t\)\+p\_\{1\}G\_\{b\}\+\\frac\{R\_\{ag\}\(t\)\}\{V\_\{g\}\},\(19\)X˙\(t\)\\displaystyle\\dot\{X\}\(t\)=−p2X\(t\)\+p3\(I\(t\)−Ib\),\\displaystyle=\-p\_\{2\}X\(t\)\+p\_\{3\}\\big\(I\(t\)\-I\_\{b\}\\big\),\(20\)I˙\(t\)\\displaystyle\\dot\{I\}\(t\)=−p4\(I\(t\)−Ib\)\+U\(t\),\\displaystyle=\-p\_\{4\}\\big\(I\(t\)\-I\_\{b\}\\big\)\+U\(t\),\(21\)whereVgV\_\{g\}\[dL\] is the glucose distribution volume andU\(t\)U\(t\)is the insulin control input around basal\. This form is obtained by writing the physical infusion as

U0\(t\)=U\(t\)\+p4Ib,U\_\{0\}\(t\)=U\(t\)\+p\_\{4\}I\_\{b\},\(22\)so thatp4Ibp\_\{4\}I\_\{b\}mimics the basal insulin dose required to prevent the open\-loop glucose drift in T1D\. Meal absorption is modeled with stomach–gut compartments whose output isRag\(t\)R\_\{ag\}\(t\)\. Letqsto1\(t\)q\_\{\\mathrm\{sto\}1\}\(t\)andqsto2\(t\)q\_\{\\mathrm\{sto\}2\}\(t\)\[mg\] denote stomach glucose in solid and liquid phases, andqgut\(t\)q\_\{\\mathrm\{gut\}\}\(t\)\[mg\] denote intestinal glucose\. A meal of sizeDD\[mg\] consumed at timetmt\_\{m\}is injected using a Dirac impulse:

q˙sto1\(t\)\\displaystyle\\dot\{q\}\_\{\\mathrm\{sto\}1\}\(t\)=−k21qsto1\(t\)\+Dδ\(t−tm\),\\displaystyle=\-k\_\{21\}\\,q\_\{\\mathrm\{sto\}1\}\(t\)\+D\\,\\delta\(t\-t\_\{m\}\),\(23\)q˙sto2\(t\)\\displaystyle\\dot\{q\}\_\{\\mathrm\{sto\}2\}\(t\)=−kempt\(qsto\(t\)\)qsto2\(t\)\+k21qsto1\(t\),\\displaystyle=\-k\_\{\\mathrm\{empt\}\}\\\!\\big\(q\_\{\\mathrm\{sto\}\}\(t\)\\big\)\\,q\_\{\\mathrm\{sto\}2\}\(t\)\+k\_\{21\}\\,q\_\{\\mathrm\{sto\}1\}\(t\),\(24\)q˙gut\(t\)\\displaystyle\\dot\{q\}\_\{\\mathrm\{gut\}\}\(t\)=−kabsqgut\(t\)\+kempt\(qsto\(t\)\)qsto2\(t\),\\displaystyle=\-k\_\{\\mathrm\{abs\}\}\\,q\_\{\\mathrm\{gut\}\}\(t\)\+k\_\{\\mathrm\{empt\}\}\\\!\\big\(q\_\{\\mathrm\{sto\}\}\(t\)\\big\)\\,q\_\{\\mathrm\{sto\}2\}\(t\),\(25\)Rag\(t\)\\displaystyle R\_\{ag\}\(t\)=fkabsqgut\(t\),qsto\(t\)=qsto1\(t\)\+qsto2\(t\),\\displaystyle=f\\,k\_\{\\mathrm\{abs\}\}\\,q\_\{\\mathrm\{gut\}\}\(t\),\\qquad q\_\{\\mathrm\{sto\}\}\(t\)=q\_\{\\mathrm\{sto\}1\}\(t\)\+q\_\{\\mathrm\{sto\}2\}\(t\),\(26\)wherek21k\_\{21\}\[min\-1\] governs transfer from solid to liquid stomach phase,kabsk\_\{\\mathrm\{abs\}\}\[min\-1\] governs intestinal absorption, andffis a dimensionless bioavailability factor\. Gastric emptying is modeled as a smooth nonlinearity betweenkmink\_\{\\min\}andkmaxk\_\{\\max\}:

kempt\(qsto\)\\displaystyle k\_\{\\mathrm\{empt\}\}\(q\_\{\\mathrm\{sto\}\}\)=kmin\+12\(kmax−kmin\)\(tanh⁡\[α\(qsto−bD\)\]−tanh⁡\[β\(qsto−cD\)\]\+2\),\\displaystyle=k\_\{\\min\}\+\\tfrac\{1\}\{2\}\\big\(k\_\{\\max\}\-k\_\{\\min\}\\big\)\\Big\(\\tanh\\\!\\big\[\\alpha\(q\_\{\\mathrm\{sto\}\}\-bD\)\\big\]\-\\tanh\\\!\\big\[\\beta\(q\_\{\\mathrm\{sto\}\}\-cD\)\\big\]\+2\\Big\),\(27\)α\\displaystyle\\alpha=52D\(1−b\),β=52Dc,\\displaystyle=\\frac\{5\}\{2D\(1\-b\)\},\\qquad\\beta=\\frac\{5\}\{2Dc\},\(28\)withb,c∈\(0,1\)b,c\\in\(0,1\)dimensionless shape parameters\.

Table 3:Parameter values for the Type\-1 diabetic patient used in simulation\.For creating the training data in Figure[3](https://arxiv.org/html/2606.05373#S3.F3), meal variability is introduced by perturbing both meal size and meal timing\. For each meal, the carbohydrate amount is sampled asD∼𝒰\(0\.7Dnom,1\.3Dnom\)D\\sim\\mathcal\{U\}\(0\.7D\_\{\\mathrm\{nom\}\},1\.3D\_\{\\mathrm\{nom\}\}\), withDnom∈\{20,40,60\}D\_\{\\mathrm\{nom\}\}\\in\\\{20,40,60\\\}g for breakfast, lunch, and dinner, respectively\. Meal timing is independently jittered by an offset sampled from𝒰\(−50,50\)\\mathcal\{U\}\(\-50,50\)minutes\.

## Appendix CCandidate TCN architecture pool

This appendix reports the implementation constraints and detailed architecture specifications used to construct the 61 candidate TCNs summarized in TableLABEL:tab:arch\_level\. Kernel sizes are restricted to \(f≤15f\\leq 15\), which is sufficient to represent local glucose excursions without excessively large convolutional filters\. Encoder channel width are initialized with 2 filters in the first temporal block and increased by a factor of two across successive encoder blocks, up to a maximum of 512 filters, with a mirrored decoder schedule\. TablesLABEL:tab:level1–LABEL:tab:level3summarize the Level 1–3 architectures explored by EVIDENT in the numerical experiments, and TableLABEL:tab:randomlists the 19 architectures used in the random\-search baseline\.

Table 4:TCN architectures for Level 1\. The dilation factor in encoder blockjjis defined asδj=bj−1\\delta\_\{j\}=b^\{j\-1\}, wherebbis the dilation base\. The decoder channels\{c~j\}j=1N\\\{\\tilde\{c\}\_\{j\}\\\}\_\{j=1\}^\{N\}mirror the encoder channels\{cj\}j=1N\\\{c\_\{j\}\\\}\_\{j=1\}^\{N\}in reverse order\.Architecture IDParameter countTemporal blocksNNFilter sizeffDilation basebbEncoder channels\{cj\}j=1N\\\{c\_\{j\}\\\}\_\{j=1\}^\{N\}Level1Arch1Level\_\{1\}Arch\_\{1\}2756399\[2,4,8\]\[2,4,8\]Level1Arch2Level\_\{1\}Arch\_\{2\}33563118\[2,4,8\]\[2,4,8\]Level1Arch3Level\_\{1\}Arch\_\{3\}33563119\[2,4,8\]\[2,4,8\]Level1Arch4Level\_\{1\}Arch\_\{4\}39563138\[2,4,8\]\[2,4,8\]Level1Arch5Level\_\{1\}Arch\_\{5\}8492475\[2,4,8,16\]\[2,4,8,16\]Level1Arch6Level\_\{1\}Arch\_\{6\}10696399\[4,8,16\]\[4,8,16\]Level1Arch7Level\_\{1\}Arch\_\{7\}130483118\[4,8,16\]\[4,8,16\]Level1Arch8Level\_\{1\}Arch\_\{8\}130483119\[4,8,16\]\[4,8,16\]Level1Arch9Level\_\{1\}Arch\_\{9\}132764114\[2,4,8,16\]\[2,4,8,16\]Level1Arch10Level\_\{1\}Arch\_\{10\}154003138\[4,8,16\]\[4,8,16\]Table 5:TCN architectures for Level 2\. The dilation factor in encoder blockjjis defined asδj=bj−1\\delta\_\{j\}=b^\{j\-1\}, wherebbis the dilation base\. The decoder channels\{c~j\}j=1N\\\{\\tilde\{c\}\_\{j\}\\\}\_\{j=1\}^\{N\}mirror the encoder channels\{cj\}j=1N\\\{c\_\{j\}\\\}\_\{j=1\}^\{N\}in reverse order\. EVIDENT identifiesLevel2Arch4Level\_\{2\}Arch\_\{4\}as the trustworthy predictor\.Architecture IDParameter countTemporal blocksNNFilter sizeffDilation basebbEncoder channels\{cj\}j=1N\\\{c\_\{j\}\\\}\_\{j=1\}^\{N\}Level2Arch1Level\_\{2\}Arch\_\{1\}33560475\[4,8,16,32\]\[4,8,16,32\]Level2Arch2Level\_\{2\}Arch\_\{2\}33708573\[2,4,8,16,32\]\[2,4,8,16,32\]Level2Arch3Level\_\{2\}Arch\_\{3\}42128399\[8,16,32\]\[8,16,32\]𝐋𝐞𝐯𝐞𝐥𝟐𝐀𝐫𝐜𝐡𝟒\\mathbf\{Level\_\{2\}Arch\_\{4\}\}43268593\[2, 4, 8, 16, 32\]Level2Arch5Level\_\{2\}Arch\_\{5\}514403118\[8,16,32\]\[8,16,32\]Level2Arch6Level\_\{2\}Arch\_\{6\}514403119\[8,16,32\]\[8,16,32\]Level2Arch7Level\_\{2\}Arch\_\{7\}526004114\[4,8,16,32\]\[4,8,16,32\]Level2Arch8Level\_\{2\}Arch\_\{8\}57852633\[2,4,8,16,32,64\]\[2,4,8,16,32,64\]Level2Arch9Level\_\{2\}Arch\_\{9\}607523138\[8,16,32\]\[8,16,32\]Table 6:TCN architectures for Level 3\. The dilation factor in encoder blockjjis defined asδj=bj−1\\delta\_\{j\}=b^\{j\-1\}, wherebbis the dilation base\. The decoder channels\{c~j\}j=1N\\\{\\tilde\{c\}\_\{j\}\\\}\_\{j=1\}^\{N\}mirror the encoder channels\{cj\}j=1N\\\{c\_\{j\}\\\}\_\{j=1\}^\{N\}in reverse order\.Architecture IDParameter countTemporal blocksNNFilter sizeffDilation basebbEncoder channels\{cj\}j=1N\\\{c\_\{j\}\\\}\_\{j=1\}^\{N\}Level3Arch1Level\_\{3\}Arch\_\{1\}133424475\[8,16,32,64\]\[8,16,32,64\]Level3Arch2Level\_\{3\}Arch\_\{2\}134168573\[4,8,16,32,64\]\[4,8,16,32,64\]Level3Arch3Level\_\{3\}Arch\_\{3\}167200399\[16,32,64\]\[16,32,64\]Level3Arch4Level\_\{3\}Arch\_\{4\}172360593\[4,8,16,32,64\]\[4,8,16,32,64\]Level3Arch5Level\_\{3\}Arch\_\{5\}2042563118\[16,32,64\]\[16,32,64\]Level3Arch6Level\_\{3\}Arch\_\{6\}2042563119\[16,32,64\]\[16,32,64\]Level3Arch7Level\_\{3\}Arch\_\{7\}2093924114\[8,16,32,64\]\[8,16,32,64\]Level3Arch8Level\_\{3\}Arch\_\{8\}230328633\[4,8,16,32,64,128\]\[4,8,16,32,64,128\]Level3Arch9Level\_\{3\}Arch\_\{9\}2413123138\[16,32,64\]\[16,32,64\]Level3Arch10Level\_\{3\}Arch\_\{10\}2490126132\[2,4,8,16,32,64\]\[2,4,8,16,32,64\]Table 7:Random\-search baseline architecture subset\. The 19 architectures were sampled uniformly without replacement from the same feasible pool of 61 TCN candidates used by EVIDENT\.Random\-search Architecture IDEVIDENT IDParameter countTemporal blocksNNFilter sizeffDilation basebbEncoder channels\{cj\}j=1N\\\{c\_\{j\}\\\}\_\{j=1\}^\{N\}RandArch1RandArch\_\{1\}Level1Arch1Level\_\{1\}Arch\_\{1\}2756399\[2,4,8\]\[2,4,8\]RandArch2RandArch\_\{2\}Level1Arch3Level\_\{1\}Arch\_\{3\}33563119\[2,4,8\]\[2,4,8\]RandArch3RandArch\_\{3\}Level1Arch4Level\_\{1\}Arch\_\{4\}39563138\[2,4,8\]\[2,4,8\]RandArch4RandArch\_\{4\}Level1Arch9Level\_\{1\}Arch\_\{9\}132764114\[2,4,8,16\]\[2,4,8,16\]RandArch5RandArch\_\{5\}Level2Arch1Level\_\{2\}Arch\_\{1\}33560475\[4,8,16,32\]\[4,8,16,32\]RandArch6RandArch\_\{6\}Level2Arch4Level\_\{2\}Arch\_\{4\}43268593\[2,4,8,16,32\]\[2,4,8,16,32\]RandArch7RandArch\_\{7\}Level2Arch8Level\_\{2\}Arch\_\{8\}57852633\[2,4,8,16,32,64\]\[2,4,8,16,32,64\]𝐑𝐚𝐧𝐝𝐀𝐫𝐜𝐡𝟖\\mathbf\{RandArch\_\{8\}\}𝐋𝐞𝐯𝐞𝐥𝟑𝐀𝐫𝐜𝐡𝟏\\mathbf\{Level\_\{3\}Arch\_\{1\}\}133424475\[8,16,32,64\]RandArch9RandArch\_\{9\}Level3Arch4Level\_\{3\}Arch\_\{4\}172360593\[4,8,16,32,64\]\[4,8,16,32,64\]RandArch10RandArch\_\{10\}Level3Arch5Level\_\{3\}Arch\_\{5\}2042563118\[16,32,64\]\[16,32,64\]RandArch11RandArch\_\{11\}Level3Arch8Level\_\{3\}Arch\_\{8\}230328633\[4,8,16,32,64,128\]\[4,8,16,32,64,128\]RandArch12RandArch\_\{12\}Level4Arch1Level\_\{4\}Arch\_\{1\}532064475\[16,32,64,128\]\[16,32,64,128\]RandArch13RandArch\_\{13\}Level4Arch4Level\_\{4\}Arch\_\{4\}666176399\[32,64,128\]\[32,64,128\]RandArch14RandArch\_\{14\}Level4Arch5Level\_\{4\}Arch\_\{5\}688016593\[8,16,32,64,128\]\[8,16,32,64,128\]RandArch15RandArch\_\{15\}Level5Arch1Level\_\{5\}Arch\_\{1\}2124992475\[32,64,128,256\]\[32,64,128,256\]RandArch16RandArch\_\{16\}Level5Arch5Level\_\{5\}Arch\_\{5\}2749216593\[16,32,64,128,256\]\[16,32,64,128,256\]RandArch17RandArch\_\{17\}Level5Arch6Level\_\{5\}Arch\_\{6\}32500483118\[64,128,256\]\[64,128,256\]RandArch18RandArch\_\{18\}Level6Arch4Level\_\{6\}Arch\_\{4\}10627328399\[128,256,512\]\[128,256,512\]RandArch19RandArch\_\{19\}Level6Arch10Level\_\{6\}Arch\_\{10\}158998086132\[16,32,64,128,256,512\]\[16,32,64,128,256,512\]
## References

- \[1\]D\. Zhang, Z\. Zhang, N\. Chen, Y\. Wang,[Rfnet: Multivariate long sequence time\-series forecasting based on recurrent representation and feature enhancement](https://www.sciencedirect.com/science/article/pii/S089360802400724X), Neural Networks 181 \(2025\) 106800\.[doi:https://doi\.org/10\.1016/j\.neunet\.2024\.106800](https://doi.org/https://doi.org/10.1016/j.neunet.2024.106800)\. URL[https://www\.sciencedirect\.com/science/article/pii/S089360802400724X](https://www.sciencedirect.com/science/article/pii/S089360802400724X)
- \[2\]S\. Lucas, E\. Portillo,[Methodology based on spiking neural networks for univariate time\-series forecasting](https://www.sciencedirect.com/science/article/pii/S0893608024000959), Neural Networks 173 \(2024\) 106171\.[doi:https://doi\.org/10\.1016/j\.neunet\.2024\.106171](https://doi.org/https://doi.org/10.1016/j.neunet.2024.106171)\. URL[https://www\.sciencedirect\.com/science/article/pii/S0893608024000959](https://www.sciencedirect.com/science/article/pii/S0893608024000959)
- \[3\]A\. Hu, L\. Wen, Y\. Dai, S\. Qi, J\. Wang, Z\. Chen, X\. Zhou, D\. Wang, Z\. Xu, J\. Duan,[Timecnn: Refining inscross\-variable interaction on time point for time series forecasting](https://www.sciencedirect.com/science/article/pii/S0893608025011931), Neural Networks 196 \(2026\) 108312\.[doi:https://doi\.org/10\.1016/j\.neunet\.2025\.108312](https://doi.org/https://doi.org/10.1016/j.neunet.2025.108312)\. URL[https://www\.sciencedirect\.com/science/article/pii/S0893608025011931](https://www.sciencedirect.com/science/article/pii/S0893608025011931)
- \[4\]T\. Elsken, J\. H\. Metzen, F\. Hutter, Neural architecture search: A survey, The Journal of Machine Learning Research 20 \(1\) \(2019\) 1997–2017\.
- \[5\]B\. Wang, Y\. Sun, B\. Xue, M\. Zhang, A hybrid differential evolution approach to designing deep convolutional neural networks for image classification, in: AI 2018: Advances in Artificial Intelligence: 31st Australasian Joint Conference, Wellington, New Zealand, December 11\-14, 2018, Proceedings 31, Springer, 2018, pp\. 237–250\.
- \[6\]A\. Ghosh, N\. D\. Jana, S\. Mallik, Z\. Zhao, Designing optimal convolutional neural network architecture using differential evolution algorithm, Patterns 3 \(9\) \(2022\) 100567\.
- \[7\]T\. Akiba, S\. Sano, T\. Yanase, T\. Ohta, M\. Koyama, Optuna: A next\-generation hyperparameter optimization framework, in: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2019\.
- \[8\]R\. Al\-Sabri, J\. Gao, J\. Chen, B\. M\. Oloulade, Z\. Wu,[Autoams: Automated attention\-based multi\-modal graph learning architecture search](https://www.sciencedirect.com/science/article/pii/S0893608024003514), Neural Networks 179 \(2024\) 106427\.[doi:https://doi\.org/10\.1016/j\.neunet\.2024\.106427](https://doi.org/https://doi.org/10.1016/j.neunet.2024.106427)\. URL[https://www\.sciencedirect\.com/science/article/pii/S0893608024003514](https://www.sciencedirect.com/science/article/pii/S0893608024003514)
- \[9\]M\. Yaseen, X\. Wu, Quantification of deep neural network prediction uncertainties for vvuq of machine learning models, Nuclear Science and Engineering 197 \(5\) \(2023\) 947–966\.
- \[10\]J\. M\. Twomey, A\. E\. Smith, Validation and verification, Artificial neural networks for civil engineers: Fundamentals and applications \(1997\) 44–64\.
- \[11\]A\. Arzani, L\. Yuan, P\. Newell, B\. Wang, Interpreting and generalizing deep learning in physics\-based problems with functional linear models, arXiv preprint arXiv:2307\.04569 \(2023\)\.
- \[12\]W\. Samek, G\. Montavon, S\. Lapuschkin, C\. J\. Anders, K\.\-R\. Müller, Explaining deep neural networks and beyond: A review of methods and applications, Proceedings of the IEEE 109 \(3\) \(2021\) 247–278\.
- \[13\]X\. Zhong, B\. Gallagher, S\. Liu, B\. Kailkhura, A\. Hiszpanski, T\. Y\.\-J\. Han, Explainable machine learning in materials science, npj Computational Materials 8 \(1\) \(2022\) 204\.
- \[14\]P\. Li, Q\. Hu, X\. Wang, Federated learning meets bayesian neural network: Robust and uncertainty\-aware distributed variational inference, Neural Networks 185 \(2025\) 107135\.
- \[15\]S\. Jantre, S\. Bhattacharya, T\. Maiti, Layer adaptive node selection in bayesian neural networks: statistical guarantees and implementation details, arXiv preprint arXiv:2108\.11000 \(2021\)\.
- \[16\]D\. J\. MacKay, Probable networks and plausible predictions\-a review of practical bayesian methods for supervised neural networks, Network: computation in neural systems 6 \(3\) \(1995\) 469\.
- \[17\]M\. A\. Islam, D\. S\. Deighan, D\. Faghihi, Predicting microstructure\-property of silica aerogel materials via bayesian convolutional neural networks surrogate model, in: ASME International Mechanical Engineering Congress and Exposition, Vol\. 88681, American Society of Mechanical Engineers, 2024, p\. V010T12A016\.
- \[18\]C\. Sevilla\-Salcedo, A\. Gallardo\-Antolín, V\. Gómez\-Verdejo, E\. Parrado\-Hernández, Bayesian learning of feature spaces for multitask regression, Neural Networks 179 \(2024\) 106619\.
- \[19\]A\. Immer, M\. Bauer, V\. Fortuin, G\. Rätsch, K\. M\. Emtiyaz, Scalable marginal likelihood estimation for model selection in deep learning, in: International Conference on Machine Learning, PMLR, 2021, pp\. 4563–4573\.
- \[20\]J\. Tan, B\. Liang, P\. K\. Singh, K\. A\. Farrell\-Maupin, D\. Faghihi, Toward selecting optimal predictive multiscale models, Computer Methods in Applied Mechanics and Engineering 402 \(2022\) 115517\.
- \[21\]P\. K\. Singh, K\. A\. Farrell\-Maupin, D\. Faghihi, A framework for strategic discovery of credible neural network surrogate models under uncertainty, Computer Methods in Applied Mechanics and Engineering 427 \(2024\) 117061\.
- \[22\]J\. T\. Oden, I\. Babuška, D\. Faghihi, Predictive computational science: Computer predictions in the presence of uncertainty, Encyclopedia of Computational Mechanics Second Edition \(2017\) 1–26\.
- \[23\]S\. Bai, J\. Z\. Kolter, V\. Koltun, An empirical evaluation of generic convolutional and recurrent networks for sequence modeling, arXiv preprint arXiv:1803\.01271 \(2018\)\.
- \[24\]S\. M\. A\. Zaidi, V\. Chandola, M\. Ibrahim, B\. Romanski, L\. D\. Mastrandrea, T\. Singh, Multi\-step ahead predictive model for blood glucose concentrations of type\-1 diabetic patients, Scientific Reports 11 \(1\) \(2021\) 24332\.
- \[25\]K\. He, X\. Zhang, S\. Ren, J\. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp\. 770–778\.
- \[26\]A\. Paszke, S\. Gross, F\. Massa, A\. Lerer, J\. Bradbury, G\. Chanan, T\. Killeen, Z\. Lin, N\. Gimelshein, L\. Antiga, et al\., Pytorch: An imperative style, high\-performance deep learning library, Advances in neural information processing systems 32 \(2019\)\.
- \[27\]K\. Shridhar, F\. Laumann, M\. Liwicki, A comprehensive guide to bayesian convolutional neural network with variational inference, arXiv preprint arXiv:1901\.02731 \(2019\)\.
- \[28\]D\. Deighan,[Model agnostic mfvi bnn](https://doi.org/10.5281/zenodo.20044677)\(May 2026\)\.[doi:10\.5281/zenodo\.20044677](https://doi.org/10.5281/zenodo.20044677)\. URL[https://doi\.org/10\.5281/zenodo\.20044677](https://doi.org/10.5281/zenodo.20044677)
- \[29\]R\. N\. Bergman, L\. S\. Phillips, C\. Cobelli, et al\., Physiologic evaluation of factors controlling glucose tolerance in man: measurement of insulin sensitivity and beta\-cell glucose sensitivity from the response to intravenous glucose\., The Journal of clinical investigation 68 \(6\) \(1981\) 1456–1467\.
- \[30\]C\. Dalla Man, M\. Camilleri, C\. Cobelli, A system model of oral glucose absorption: validation on gold standard data, IEEE Transactions on Biomedical Engineering 53 \(12\) \(2006\) 2472–2478\.
- \[31\]C\. D\. Man, F\. Micheletto, D\. Lv, M\. Breton, B\. Kovatchev, C\. Cobelli, The uva/padova type 1 diabetes simulator: new features, Journal of diabetes science and technology 8 \(1\) \(2014\) 26–34\.
- \[32\]J\. L\. Parkes, S\. L\. Slatin, S\. Pardo, B\. H\. Ginsberg, A new consensus error grid to evaluate the clinical significance of inaccuracies in the measurement of blood glucose\., Diabetes care 23 \(8\) \(2000\) 1143–1148\.
- \[33\]A\. Pfützner, D\. C\. Klonoff, S\. Pardo, J\. L\. Parkes, Technical aspects of the parkes error grid, Journal of Diabetes Science and Technology 7 \(5\) \(2013\) 1275–1281\.
- \[34\]J\. Bergstra, Y\. Bengio, Random search for hyper\-parameter optimization\., Journal of machine learning research 13 \(2\) \(2012\)\.
- \[35\]A\. F\. Psaros, X\. Meng, Z\. Zou, L\. Guo, G\. E\. Karniadakis, Uncertainty quantification in scientific machine learning: Methods, metrics, and comparisons, Journal of Computational Physics 477 \(2023\) 111902\.
Evidence-Guided Neural Architecture Selection under Uncertainty for Subject-Specific Blood Glucose Forecasting

Similar Articles

Neural Bayesian Sequential Routing

UASPL: Uncertainty-Aware Self-Paced Learning with Evidential Neural Networks

Robustness Meets Uncertainty: Evidential Adversarial Training for Robust Selective Classification

GlucoFM-Bench: Benchmarking Time-Series Foundation Models for Blood Glucose Forecasting

EnergyMamba: An Uncertainty-Aware Graph-Enhanced Selective State Space Model for Energy Consumption Prediction

Submit Feedback

Similar Articles

Neural Bayesian Sequential Routing
UASPL: Uncertainty-Aware Self-Paced Learning with Evidential Neural Networks
Robustness Meets Uncertainty: Evidential Adversarial Training for Robust Selective Classification
GlucoFM-Bench: Benchmarking Time-Series Foundation Models for Blood Glucose Forecasting
EnergyMamba: An Uncertainty-Aware Graph-Enhanced Selective State Space Model for Energy Consumption Prediction