Neural Bayesian Sequential Routing
Summary
Introduces Neural Bayesian Sequential Routing (NBSR), a framework that models neural inference as sequential evidence accumulation over a DAG using Dirichlet-Categorical conjugate updates, enabling uncertainty quantification, early exiting, and resource-rational inference.
View Cached Full Text
Cached at: 05/27/26, 09:03 AM
# Neural Bayesian Sequential Routing
Source: [https://arxiv.org/html/2605.26147](https://arxiv.org/html/2605.26147)
Yongchao Huang111\[Email: yongchao\.huang@abdn\.ac\.uk\] The author welcomes any follow\-up work, extensions, and adaptations of these ideas\. If this manuscript found useful in future research, appropriate citation would be appreciated\. It was developed over many days and nights with the aim of providing a self\-contained material for open knowledge sharing, although some \(many\) errors may still remain after careful review\.
\(28/03/2026\)
###### Abstract
Human decision\-making is inherently sequential and uncertainty\-aware, yet standard deep neural networks often rely on static, dense forward computation that provides limited visibility into how evidence is acquired, how uncertainty evolves, or when computation should stop\. While conditional architectures such as Mixture\-of\-Experts \(MoE\) introduce input\-dependent computation, traditional soft\-routing mechanisms can suffer from expert imbalance or collapse, and typically do not maintain a temporally evolving belief state\. To bridge this gap, we introduceNeural Bayesian Sequential Routing \(NBSR\), a dynamic framework that models neural inference as an active, evidence\-accumulating traversal over a hierarchical Directed Acyclic Graph \(DAG\)\. Operating within a Dirichlet\-Categorical conjugate framework, specialized neural experts query a persistent global knowledge oracle to extract strictly positive evidence vectors, which act as pseudo\-counts and update a Dirichlet belief state via exact conjugate addition\. By coupling this Bayesian belief update with a Gumbel\-Softmax Straight\-Through estimator, NBSR enables hard, path\-dependent routing while preserving surrogate gradients for end\-to\-end training\. The resulting Dirichlet precision and entropy provide native mechanisms for uncertainty quantification, entropy\-based early exiting, OOD abstention, and cost\-aware evidence acquisition\. We provide theoretical guarantees showing that, under strictly positive evidence extraction, total Dirichlet precision increases monotonically along any valid trajectory and the marginal predictive variance is correspondingly bounded, formalizing the intended sequential “hypothesis sharpening” behavior; under idealized capacity and optimization assumptions, the terminal Dirichlet expectation recovers the Bayes\-optimal conditional distribution\. Empirical evaluations on visual categorization, structured medical diagnosis, language modeling, partially observable control, and cost\-aware Bayesian experimental design show that NBSR achieves competitive predictive performance while yielding transparent routing traces, path\-dependent evidence attribution, uncertainty\-aware decision control, and resource\-rational inference\. Ultimately, NBSR provides a mathematically grounded framework for interpretable, modular, and resource\-rational agentic AI\.
###### Contents
1. [1Introduction](https://arxiv.org/html/2605.26147#S1)
2. [2Related Works](https://arxiv.org/html/2605.26147#S2)
3. [3Preliminaries](https://arxiv.org/html/2605.26147#S3)
4. [4Methodology](https://arxiv.org/html/2605.26147#S4)1. [4\.1Pipeline Overview](https://arxiv.org/html/2605.26147#S4.SS1) 2. [4\.2The Decision Graph](https://arxiv.org/html/2605.26147#S4.SS2) 3. [4\.3Global Feature Representation: The Knowledge Oracle](https://arxiv.org/html/2605.26147#S4.SS3) 4. [4\.4The Bayesian State and Initialization](https://arxiv.org/html/2605.26147#S4.SS4) 5. [4\.5The Differentiable Router Network](https://arxiv.org/html/2605.26147#S4.SS5) 6. [4\.6Evidence Extraction and Belief Update](https://arxiv.org/html/2605.26147#S4.SS6) 7. [4\.7Training Objective and Dynamics](https://arxiv.org/html/2605.26147#S4.SS7) 8. [4\.8Inference, Early Exiting, and Epistemic Abstention](https://arxiv.org/html/2605.26147#S4.SS8) 9. [4\.9Training the NBSR Tree](https://arxiv.org/html/2605.26147#S4.SS9)
5. [5Theoretical Analysis](https://arxiv.org/html/2605.26147#S5)1. [5\.1Monotonicity of Precision and Variance Reduction](https://arxiv.org/html/2605.26147#S5.SS1) 2. [5\.2Asymptotic Consistency of the Belief State](https://arxiv.org/html/2605.26147#S5.SS2) 3. [5\.3Hyperparameter Dynamics and Information Acquisition](https://arxiv.org/html/2605.26147#S5.SS3) 4. [5\.4Topological Bias\-Variance Trade\-off](https://arxiv.org/html/2605.26147#S5.SS4)
6. [6Experiments191919The experimental Python codes were largely enabled with the kind assistance of Gemini 3\.0\[24\], for which the author gratefully acknowledges\.](https://arxiv.org/html/2605.26147#S6)1. [6\.1A Toy Experiment: Sequential Belief Sharpening](https://arxiv.org/html/2605.26147#S6.SS1) 2. [6\.2Visual Categorization: CIFAR\-10](https://arxiv.org/html/2605.26147#S6.SS2)1. [6\.2\.1Experimental Setup and Baselines](https://arxiv.org/html/2605.26147#S6.SS2.SSS1) 2. [6\.2\.2Results and Analysis](https://arxiv.org/html/2605.26147#S6.SS2.SSS2) 3. [6\.3Structured Medical Diagnosis272727In this context, “structured” refers to tabular data characterized by predefined, discrete feature columns \(e\.g\. binary symptom indicators\), in direct contrast to the unstructured spatial manifolds \(e\.g\. raw image pixels\) evaluated in the previous visual categorization task\.](https://arxiv.org/html/2605.26147#S6.SS3)1. [6\.3\.1Experimental Setup and Baselines](https://arxiv.org/html/2605.26147#S6.SS3.SSS1) 2. [6\.3\.2Results and Analysis](https://arxiv.org/html/2605.26147#S6.SS3.SSS2) 4. [6\.4Language Modelling: Interpretable and Uncertainty\-Aware Next\-Token Prediction](https://arxiv.org/html/2605.26147#S6.SS4)1. [6\.4\.1Experimental Setup and Baselines](https://arxiv.org/html/2605.26147#S6.SS4.SSS1) 2. [6\.4\.2Results and Analysis](https://arxiv.org/html/2605.26147#S6.SS4.SSS2) 5. [6\.5NBSR\-Mem: NBSR with Dynamic Memory for Control and Planning](https://arxiv.org/html/2605.26147#S6.SS5)1. [6\.5\.1Task and Experimental Setup\.](https://arxiv.org/html/2605.26147#S6.SS5.SSS1) 2. [6\.5\.2Results and Analysis](https://arxiv.org/html/2605.26147#S6.SS5.SSS2) 6. [6\.6NBSR as Active Learning in Bayesian Optimal Experimental Design](https://arxiv.org/html/2605.26147#S6.SS6)1. [6\.6\.1Task and Experimental Setup: Active Clinical Triage](https://arxiv.org/html/2605.26147#S6.SS6.SSS1) 2. [6\.6\.2Results and Analysis](https://arxiv.org/html/2605.26147#S6.SS6.SSS2)
7. [7Discussion](https://arxiv.org/html/2605.26147#S7)1. [7\.1Active Knowledge Retrieval vs\. Information Bottlenecking](https://arxiv.org/html/2605.26147#S7.SS1) 2. [7\.2Hierarchical Epistemic Capacity and Safe Inference](https://arxiv.org/html/2605.26147#S7.SS2) 3. [7\.3Decoupling of Training and Inference](https://arxiv.org/html/2605.26147#S7.SS3) 4. [7\.4NBSR as an Markov Decision Process \(MDP\)](https://arxiv.org/html/2605.26147#S7.SS4) 5. [7\.5Modular Skill Acquisition and Unbounded Topologies](https://arxiv.org/html/2605.26147#S7.SS5)
8. [8Conclusion](https://arxiv.org/html/2605.26147#S8)
9. [References](https://arxiv.org/html/2605.26147#bib)
10. [AThe Dirichlet Distribution](https://arxiv.org/html/2605.26147#A1)1. [A\.1Definition and Support](https://arxiv.org/html/2605.26147#A1.SS1) 2. [A\.2Conjugacy to the Categorical Distribution](https://arxiv.org/html/2605.26147#A1.SS2) 3. [A\.3Expectation, Variance, and Covariance \(The “Sharpening” Effect\)](https://arxiv.org/html/2605.26147#A1.SS3) 4. [A\.4Marginal Distributions](https://arxiv.org/html/2605.26147#A1.SS4) 5. [A\.5Epistemic Uncertainty and Subjective Logic](https://arxiv.org/html/2605.26147#A1.SS5) 6. [A\.6Differential Entropy and Uncertainty Reduction](https://arxiv.org/html/2605.26147#A1.SS6) 7. [A\.7Kullback\-Leibler Divergence between Two Dirichlet Distributions](https://arxiv.org/html/2605.26147#A1.SS7)
11. [BThe Gumbel Distribution](https://arxiv.org/html/2605.26147#A2)1. [B\.1Definition and Support](https://arxiv.org/html/2605.26147#A2.SS1) 2. [B\.2Mean and Variance](https://arxiv.org/html/2605.26147#A2.SS2) 3. [B\.3Extreme Value Theory and the Gumbel\-Max Trick](https://arxiv.org/html/2605.26147#A2.SS3)
12. [CThe Gumbel\-Softmax Continuous Relaxation](https://arxiv.org/html/2605.26147#A3)1. [C\.1Continuous Approximation of Argmax](https://arxiv.org/html/2605.26147#A3.SS1) 2. [C\.2Temperature Annealing](https://arxiv.org/html/2605.26147#A3.SS2) 3. [C\.3The Straight\-Through Estimator \(STE\)](https://arxiv.org/html/2605.26147#A3.SS3)
13. [DDerivation of the Generalized Bias\-Variance Decomposition](https://arxiv.org/html/2605.26147#A4)
14. [EDifference Between NBSR and Classical Decision Trees](https://arxiv.org/html/2605.26147#A5)
15. [FFurther Results for CIFAR\-10 Classification](https://arxiv.org/html/2605.26147#A6)1. [F\.1Computing Environment and Experimental Setup](https://arxiv.org/html/2605.26147#A6.SS1) 2. [F\.2Evaluation Metric: Expected Calibration Error \(ECE\)](https://arxiv.org/html/2605.26147#A6.SS2) 3. [F\.3Baseline Training Dynamics](https://arxiv.org/html/2605.26147#A6.SS3)
16. [GExperimental Details: Structured Medical Diagnosis](https://arxiv.org/html/2605.26147#A7)1. [G\.1Computing Environment](https://arxiv.org/html/2605.26147#A7.SS1) 2. [G\.2Dataset and Preprocessing](https://arxiv.org/html/2605.26147#A7.SS2) 3. [G\.3Network Architectures](https://arxiv.org/html/2605.26147#A7.SS3) 4. [G\.4Optimization and Hyperparameters](https://arxiv.org/html/2605.26147#A7.SS4)
17. [HExperimental Details: Language Modeling](https://arxiv.org/html/2605.26147#A8)1. [H\.1Computing Environment](https://arxiv.org/html/2605.26147#A8.SS1) 2. [H\.2Dataset and Preprocessing](https://arxiv.org/html/2605.26147#A8.SS2) 3. [H\.3Network Architectures](https://arxiv.org/html/2605.26147#A8.SS3) 4. [H\.4Optimization and Hyperparameters](https://arxiv.org/html/2605.26147#A8.SS4)
18. [IExperimental Setup for the POMDP Navigation Task](https://arxiv.org/html/2605.26147#A9)1. [I\.1Computing Environment](https://arxiv.org/html/2605.26147#A9.SS1) 2. [I\.2POMDP Dataset Generation](https://arxiv.org/html/2605.26147#A9.SS2) 3. [I\.3Model Architectures](https://arxiv.org/html/2605.26147#A9.SS3) 4. [I\.4Training Protocol and Hyperparameters](https://arxiv.org/html/2605.26147#A9.SS4)
19. [JExperimental Details for BOED Active Clinical Triage](https://arxiv.org/html/2605.26147#A10)1. [J\.1Computing Environment](https://arxiv.org/html/2605.26147#A10.SS1) 2. [J\.2Dataset and Preprocessing](https://arxiv.org/html/2605.26147#A10.SS2) 3. [J\.3Network Architectures](https://arxiv.org/html/2605.26147#A10.SS3) 4. [J\.4Optimization and Training Dynamics](https://arxiv.org/html/2605.26147#A10.SS4)
## 1Introduction
Human decision\-making is naturally sequential, distributional, and hierarchical\[[46](https://arxiv.org/html/2605.26147#bib.bib59),[73](https://arxiv.org/html/2605.26147#bib.bib60)\]\. When navigating complex environments or diagnosing intricate problems, humans do not evaluate all possible information simultaneously; rather, we selectively accumulate evidence step\-by\-step\[[6](https://arxiv.org/html/2605.26147#bib.bib61)\]\. Decisions are made at discrete junctures conditioned on both the current contextual state and the trajectory of prior outcomes\. As information flows through this mental decision tree, epistemic variance shrinks, and our broad initial hypotheses “sharpen” into narrow, confident conclusions\.
In contrast, standard deep learning paradigms typically rely on “flat”, monolithic architectures that process all input featuressimultaneouslyto produce a dense probability distribution over all possible decisions\[[49](https://arxiv.org/html/2605.26147#bib.bib62),[28](https://arxiv.org/html/2605.26147#bib.bib71)\]\. While highly effective at maximizing semantic accuracy, these approaches lackcognitive plausibility\[[9](https://arxiv.org/html/2605.26147#bib.bib63)\], exhibit poor interpretability in decision\-auditing scenarios\[[51](https://arxiv.org/html/2605.26147#bib.bib64),[67](https://arxiv.org/html/2605.26147#bib.bib65)\], and often spend unnecessary computational resources on easily classified, unambiguous inputs\[[25](https://arxiv.org/html/2605.26147#bib.bib66),[27](https://arxiv.org/html/2605.26147#bib.bib67)\]\.
To introduce conditional computation, architectures such as Mixture of Experts \(MoE\)\[[39](https://arxiv.org/html/2605.26147#bib.bib68),[70](https://arxiv.org/html/2605.26147#bib.bib22)\]utilizedynamic routingto delegate inputs to specialized sub\-networks\. However, these systems inherently fall short of true sequential reasoning\. Standard MoE networks typically rely on soft, continuous weighting across parallel paths, which mitigates the computational benefits of strict conditional execution\[[16](https://arxiv.org/html/2605.26147#bib.bib26)\]\. Further, MoE routing is typically a static, single\-step, input\-conditional operation222Once trained, the weights of a standard MoE router are frozen\. It does not maintain a memory, it does not update an internal belief state, and it does not sequentially accumulate evidence across time or depth\. It simply performs a flat, input\-conditional matrix multiplication to partition a batch\.; it does not maintain a belief state, nor does it allow the network to sequentially accumulate evidence or re\-evaluate its uncertainty\.
To bridge this gap, we introduceNeural Bayesian Sequential Routing \(NBSR\)\. This approach employs aBayesian hierarchical decision graph\- a novel neural framework that models complex decision\-making as an active, evidence\-accumulating routing process\. We formulate the decision structure as aDirected Acyclic Graph\(DAG\) where each node containsa differentiable routing mechanismanda neural evidence extractor\. Operating within aDirichlet\-Categorical conjugateframework\[[21](https://arxiv.org/html/2605.26147#bib.bib57),[4](https://arxiv.org/html/2605.26147#bib.bib69)\], the model maintains a persistent belief state over the final outcome space\. At each routing step, local neural experts actively query the global data to extract strictly positive evidence vectors\. These vectors act as Bayesian pseudo\-counts, deterministically updating the Dirichlet concentration parameters and natively mirroring the human cognitive process of dynamic uncertainty reduction\.
To enable end\-to\-end training of this discrete tree structure, we employ aGumbel\-Softmax relaxation\[[40](https://arxiv.org/html/2605.26147#bib.bib9),[53](https://arxiv.org/html/2605.26147#bib.bib10)\]coupled with theStraight\-Through Estimator\(STE\)\[[3](https://arxiv.org/html/2605.26147#bib.bib11)\]\. This allows for hard, path\-dependent routing during inference \(which drastically reduces computational FLOPs\) while maintaining smooth surrogate gradient flow to the routers during backpropagation\. Furthermore, because NBSR natively tracks epistemic uncertainty, it naturally accommodates extensions into autonomous planning via recurrent memory \(POMDP navigation\)\[[45](https://arxiv.org/html/2605.26147#bib.bib70)\]and resource\-rational active learning \(Bayesian Optimal Experimental Design\)\[[68](https://arxiv.org/html/2605.26147#bib.bib51),[18](https://arxiv.org/html/2605.26147#bib.bib50)\]\.
We empirically validate the efficacy, interpretability, and computational efficiency of this NBSR framework across a highly diverse suite of five domains: \(1\)visual object categorization\(CIFAR\-10\); \(2\)structured medical diagnosis, yielding personalized feature attribution; \(3\)language modelingvia interpretable syntactic\-to\-semantic token routing; \(4\)partially observable control\(POMDPs\) utilizing a recurrent memory state; and \(5\)active clinical triagemodeled as a resource\-rational BOED agent\. Our core contributions are:
1. 1\.A Bayesian Sequential Routing Framework:We propose a novel formulation for conditional neural execution where discrete routing decisions sequentially update a Dirichlet belief state via exact conjugate addition, mathematically enforcing the progressive sharpening of decision boundaries\.
2. 2\.End\-to\-End Hard Routing:We adapt the Gumbel\-Softmax estimator to train discrete hierarchical decision trees over deep feature extractors, naturally preventing representation collapse while enabling dynamic, entropy\-based early exiting to minimize inference costs\.
3. 3\.Theoretical Foundations:We provide formal theoretical guarantees for the NBSR framework, proving the strict monotonicity of precision accumulation, bounding the variance reduction, and establishing the asymptotic consistency of the learning objective\.
4. 4\.Interpretable and Safe Inference:We demonstrate that our framework provides fully transparent, causal audit trails for every decision, whilst utilizing its native Dirichlet precision metric to successfully trigger out\-of\-distribution \(OOD\) safety abstentions\[[29](https://arxiv.org/html/2605.26147#bib.bib40),[69](https://arxiv.org/html/2605.26147#bib.bib28)\]and prevent hallucinations\.
5. 5\.Resource\-Rational Extensions:We successfully extend the core DAG topology to incorporate temporal memory buffers \(NBSR\-Mem\) for autonomous control, and formulate it as an activeAutoregressive State Machine\(AR\-NBSR\) capable of executing long\-horizon, budget\-constrained Bayesian Optimal Experimental Design\.
## 2Related Works
##### Bayesian Belief Updating
Standard deep learning architectures primarily output deterministic point estimates, which fail to capture the underlying uncertainty of the data distribution\. To address this, Bayesian Neural Networks \(BNNs\)\[[56](https://arxiv.org/html/2605.26147#bib.bib53)\]introduce probability distributions over the network weights, which results in distributional predictions\. The bottleneck lies in shaping the weights distributions based on data\. Foundational approaches such asBayes by Backprop\[[5](https://arxiv.org/html/2605.26147#bib.bib54)\]utilize variational inference to approximate intractable weight posteriors, whileMonte Carlo Dropout\[[20](https://arxiv.org/html/2605.26147#bib.bib55)\]provides a mathematically grounded, computationally cheaper approximation by performing multiple stochastic forward passes\. However, these methods primarily focus on capturing epistemic uncertainty within the model parameters rather than explicitly modeling the sequential accumulation of state\-belief\.
Alternatively, classical Bayesian probability relies on exact posterior inference usingconjugate priors\. When the posterior distribution belongs to the same probability family as the prior, the likelihood function updates the prior analytically via closed\-form algebraic addition\[[65](https://arxiv.org/html/2605.26147#bib.bib56),[21](https://arxiv.org/html/2605.26147#bib.bib57),[48](https://arxiv.org/html/2605.26147#bib.bib58)\]\. For example, the Beta distribution acts as the conjugate prior for the Binomial likelihood, and the Dirichlet distribution acts as the conjugate prior for the Categorical likelihood\. While typically restricted to simple statistical models, our Neural Bayesian Sequential Routing \(NBSR\) framework leverages this exact conjugacy in a deep learning context\. By treating continuous neural outputs as pseudo\-observations, NBSR performs sequential belief updating without the prohibitive computational overhead of standard variational BNN ensembles\.
##### Evidential Deep Learning
Evidential Deep Learning \(EDL\), pioneered by Sensoy et al\.\[[69](https://arxiv.org/html/2605.26147#bib.bib28)\], provides an elegant paradigm for uncertainty quantification without requiring the computational complexities of Bayesian sampling\. Rooted in Subjective Logic and the Dempster\-Shafer Theory of Evidence, EDL reformulates discrete classification tasks\. Instead of forcing a neural network to distribute a total probability mass of 1 across all classes via a softmax layer \(which may lead to overconfident predictions on unfamiliar data\), EDL trains a deterministic neural network to output the concentration parameters of a Dirichlet distribution\.
This approach parameterizes a ”distribution over distributions” over the categorical simplex\. When the network encounters familiar data, it generates high positive evidence for the target class, sharpening the Dirichlet distribution\. Conversely, when it encounters out\-of\-distribution \(OOD\) data, the network generates zero evidence, gracefully reverting the Dirichlet state to a uniform, maximum\-entropy prior \(i\.e\. explicitly outputting an ”I do not know” state\)\. EDL has demonstrated state\-of\-the\-art success in OOD detection and adversarial robustness\. Our proposed framework fundamentally extends the EDL paradigm; whereas standard EDL relies on a static, monolithic forward pass, NBSR distributes the evidence extraction process across a hierarchical, dynamically routed DAG, transforming evidential inference into an active, resource\-rational sequential process\.
##### Hierarchical Mixtures of Experts and Decision Trees
Classical divide\-and\-conquer algorithms, such as Classification and Regression Trees \(CART\)\[[7](https://arxiv.org/html/2605.26147#bib.bib77)\], explicitly partition the input space into nested regions using hard routing decisions\. To provide a differentiable, probabilistic alternative, the Hierarchical Mixture of Experts \(HME\)\[[42](https://arxiv.org/html/2605.26147#bib.bib78)\]was introduced\. HME structures neural networks as a soft decision tree where internal nodes act exclusively as gating functions and terminal leaves act as local regression or classification experts\. The final prediction is computed as a probability\-weighted continuous average of theleafexperts’ outputs\. Subsequent works extended HMEs with constructive growing and pruning algorithms to optimize the tree topology during training\[[74](https://arxiv.org/html/2605.26147#bib.bib79)\]\. While these models elegantly introduce conditional computation, they suffer from critical architectural limitations: they treat intermediate nodes purely asrouters\(delaying all decisions to the leaves\), they partition the raw input space directly, and their reliance on multiplicative, zero\-sum probabilities prevents the native quantification of epistemic uncertainty\. Our NBSR framework fundamentally departs from this paradigm\. By treating every node \(whether internal or terminal\) as an active evidence extractor, enforcing hard discrete routing to prevent representation collapse, and sequentially querying a shared, denseGlobal Knowledge Oracleto additively accumulate Dirichlet evidence, NBSR transforms the tree from a simple soft\-routing mechanism into an active, uncertainty\-aware Bayesian reasoning process333A comprehensive architectural comparison can be found in Appendix[E](https://arxiv.org/html/2605.26147#A5)\.\.
## 3Preliminaries
##### Dirichlet Distribution and Evidential Belief\.
Standard deterministic neural networks produce point\-estimate probability vectors \(e\.g\. via a softmax layer\) that lack a native notion of uncertainty, frequently leading to overconfident predictions on unfamiliar data\. In contrast, evidential deep learning\[[69](https://arxiv.org/html/2605.26147#bib.bib28)\]formulates discrete predictive tasks, aKK\-class image classification for example, as a Bayesian process by predicting theconcentration parameters𝜶=\[α1,…,αK\]\\bm\{\\alpha\}=\[\\alpha\_\{1\},\\dots,\\alpha\_\{K\}\]of a Dirichlet distribution444Extended mathematical details regarding the Dirichlet distribution are provided in Appendix[A](https://arxiv.org/html/2605.26147#A1)\.\. As the multivariate generalization of the Beta distribution, the Dirichlet distribution acts as a continuous probability density over the categorical probability simplex, essentially serving as a “distribution over distributions”\.
Drawing fromSubjective Logic, this formalizes the notion of belief assignments into a framework where predictions are treated as subjective opinions\. The positive parametersαk\>0\\alpha\_\{k\}\>0represent the accumulated evidence for each respective class\. Specifically, the parameters are linked to the learned neural evidenceek≥0e\_\{k\}\\geq 0via the relationαk=ek\+1\\alpha\_\{k\}=e\_\{k\}\+1\. When a network extracts no supporting evidence \(ek=0e\_\{k\}=0for allkk\), the parameters default toαk=1\\alpha\_\{k\}=1, yielding a uniform prior distribution𝜶=\[1,…,1\]\\bm\{\\alpha\}=\[1,\\dots,1\]\. This state explicitly represents total epistemic uncertainty, allowing the model to mathematically express “I do not know” when faced with ambiguous or out\-of\-distribution inputs\.
Under this joint formulation, to focus on the probabilitypkp\_\{k\}of a specific classkk, we integrate out the probabilities of the otherK−1K\-1classes\. The resulting marginal distribution for classkkmathematically collapses into a Beta distribution:
pk∼Beta\(αk,α0−αk\)p\_\{k\}\\sim\\text\{Beta\}\(\\alpha\_\{k\},\\alpha\_\{0\}\-\\alpha\_\{k\}\)whereα0=∑i=1Kαi\\alpha\_\{0\}=\\sum\_\{i=1\}^\{K\}\\alpha\_\{i\}represents the total precision\. Rooted in the framework of Subjective Logic\[[43](https://arxiv.org/html/2605.26147#bib.bib73),[69](https://arxiv.org/html/2605.26147#bib.bib28)\], epistemic uncertaintyuuis explicitly defined as inversely proportional to this total evidence:
u=Kα0u=\\frac\{K\}\{\\alpha\_\{0\}\}\(cc\.Eq\.[24](https://arxiv.org/html/2605.26147#A1.E24)\)Consequently,α0\\alpha\_\{0\}acts as a direct inverse measure of structural uncertainty\. To obtain a concrete, scalar probability score for evaluating the network \(such as for computing the Negative Log\-Likelihood\), we calculate the mathematical mean of this marginal distribution\. The resulting Bayesian expected marginal is:
𝔼\[pk\]=αkα0\\mathbb\{E\}\[p\_\{k\}\]=\\frac\{\\alpha\_\{k\}\}\{\\alpha\_\{0\}\}\(cc\.Eq\.[17](https://arxiv.org/html/2605.26147#A1.E17)\)By utilizing these expected marginals rather than deterministic softmax outputs, our framework explicitly binds predictive confidence to total accumulated evidence\.
##### Differential Entropy of the Dirichlet distribution\.
To quantify the total structural uncertainty of the belief state, we rely on the differential entropy of the Dirichlet distribution, denoted asℋ\(𝜶\)\\mathcal\{H\}\(\\bm\{\\alpha\}\)\. In our framework, this entropy serves as a mathematically rigorous metric for epistemic confidence and is computed analytically as:
ℋ\(𝜶\)=logB\(𝜶\)\+\(α0−K\)ψ\(α0\)−∑k=1K\(αk−1\)ψ\(αk\)\\mathcal\{H\}\(\\bm\{\\alpha\}\)=\\log\\text\{B\}\(\\bm\{\\alpha\}\)\+\(\\alpha\_\{0\}\-K\)\\psi\(\\alpha\_\{0\}\)\-\\sum\_\{k=1\}^\{K\}\(\\alpha\_\{k\}\-1\)\\psi\(\\alpha\_\{k\}\)\(cc\.Eq\.[25](https://arxiv.org/html/2605.26147#A1.E25)\)whereB\(𝜶\)\\text\{B\}\(\\bm\{\\alpha\}\)is the multivariate Beta function andψ\(⋅\)\\psi\(\\cdot\)denotes the digamma function \(the logarithmic derivative of the Gamma function\)\. As evidence accumulates \(α0→∞\\alpha\_\{0\}\\to\\infty\), the variance of the distribution strictly shrinks, driving the differential entropyℋ\(𝜶\)\\mathcal\{H\}\(\\bm\{\\alpha\}\)downward towards−∞\-\\inftyin the limit of absolute certainty555As formalized later in Theorem[1](https://arxiv.org/html/2605.26147#Thmtheorem1), while the total precision strictly increases and the variance strictly shrinks at every sequential step, the differential entropyℋ\(𝜶\)\\mathcal\{H\}\(\\bm\{\\alpha\}\)is not mathematically guaranteed to decrease strictly monotonically\. Specifically, an asymmetrical evidence update that strongly contradicts the current prior can cause a momentary entropy spike as the distribution shifts its center of mass, though it ultimately continues its asymptotic collapse towards−∞\-\\infty\.\. We exploit this informational dynamic by usingℋ\(𝜶\)\\mathcal\{H\}\(\\bm\{\\alpha\}\)as a temporal penalty during training, and use a threshold \(η\\eta\) for dynamic early exiting during inference\.
##### Differentiable Discrete Routing\.
A core challenge in hierarchical neural networks is routing data through discrete, conditional paths\. Let𝐳=\(z1,…,zK\)\\mathbf\{z\}=\(z\_\{1\},\\dots,z\_\{K\}\)denote the routing logits at a given decision node\. To encourage structural exploration during training, the network must sample a routing path stochastically\. However, standard categorical sampling and the discreteargmaxoperation are fundamentally non\-differentiable, severing the backpropagation of error gradients\. To resolve this, we leverage theGumbel\-Softmax continuous relaxation\[[40](https://arxiv.org/html/2605.26147#bib.bib9),[53](https://arxiv.org/html/2605.26147#bib.bib10)\], which serves as are\-parameterization trickfor discrete distributions\. By injecting independent Standard Gumbel noisegk∼Gumbel\(0,1\)g\_\{k\}\\sim\\text\{Gumbel\}\(0,1\)into the routing logits, we cleanly decouple the stochasticity from the deterministic network parameters\. This enables us to compute a smooth, differentiable approximation vector𝝅\\bm\{\\pi\}using a temperature\-scaled softmax:
πk=exp\(\(zk\+gk\)/τ\)∑i=1Kexp\(\(zi\+gi\)/τ\)\\pi\_\{k\}=\\frac\{\\exp\(\(z\_\{k\}\+g\_\{k\}\)/\\tau\)\}\{\\sum\_\{i=1\}^\{K\}\\exp\(\(z\_\{i\}\+g\_\{i\}\)/\\tau\)\}\(cc\.Eq\.[32](https://arxiv.org/html/2605.26147#A3.E32)\)where the temperature hyperparameterτ\>0\\tau\>0controls the smoothness of the distribution\. Asτ→0\\tau\\to 0, the continuous output𝝅\\bm\{\\pi\}asymptotically approaches a discrete one\-hot vector\.
To enforce strict conditional execution \(saving computational FLOPs on unvisited branches\) while preserving end\-to\-end trainability, we couple this relaxation with theStraight\-Through Estimator\(STE\)\[[3](https://arxiv.org/html/2605.26147#bib.bib11)\]\. During the forward pass, we discretize𝝅\\bm\{\\pi\}into a hard one\-hot vector𝐚hard\\mathbf\{a\}\_\{\\text\{hard\}\}via theargmaxoperator\. During the backward pass, the non\-differentiable discretization is bypassed, and gradients flow directly through the continuous relaxation𝝅\\bm\{\\pi\}\. Mathematically, this surrogate gradient operation is implemented natively in modern autograd engines as666The\.detach\(\)operator severs a tensor from the computational graph, preventing gradients from propagating through it\. During the forward pass, the detached term functionally cancels out𝝅\\bm\{\\pi\}, simplifying the equation to exactly𝐚hard\\mathbf\{a\}\_\{\\text\{hard\}\}\. During the backward pass, the gradient of the detached term evaluates to zero, ensuring the learning signal bypasses the non\-differentiableargmaxstep and flows exclusively through the continuous approximation\+𝝅\+\\bm\{\\pi\}\.:
𝐚out=\(𝐚hard−𝝅\)\.detach\(\)\+𝝅\\mathbf\{a\}\_\{\\text\{out\}\}=\(\\mathbf\{a\}\_\{\\text\{hard\}\}\-\\bm\{\\pi\}\)\\text\{\.detach\(\)\}\+\\bm\{\\pi\}\(cc\.Eq\.[33](https://arxiv.org/html/2605.26147#A3.E33)\)This mechanism guarantees that the forward computation remains strictly discrete and path\-dependent, while the backward pass provides the dense, continuous gradients necessary to update the upstream gating parameters \(extended theoretical details see Appendix[C](https://arxiv.org/html/2605.26147#A3)\)\.
## 4Methodology
Our framework decomposes the decision\-making process intothreesequential phases:Global Feature Extraction,Graph\-Based Routing, andBayesian Evidence Accumulation\. Below, we detail the exact network architecture and logic, using aKK\-dimensional outcome space as our primary illustrative model \(e\.g\.K=10K=10for CIFAR\-10 categorization\)\.
##### Notations
To ensure clarity throughout the methodological and theoretical formulations, we summarize the key mathematical notations used in this paper in Table[1](https://arxiv.org/html/2605.26147#S4.T1)\.
Table 1:Summary of Key Mathematical Notations
### 4\.1Pipeline Overview
The neural Bayesian sequential routing pipeline transforms raw input data into a series ofdiscrete, uncertainty\-aware decisionsthat progressively refine aglobal belief state\. This process mirrors human cognitive strategies, where an initial broad hypothesis is “sharpened” through the acquisition of specific evidence sequentially\.
The decision\-making architecture consists of aDirected Acyclic Graph \(DAG\)𝒢=\(𝒱,ℰ\)\\mathcal\{G\}=\(\\mathcal\{V\},\\mathcal\{E\}\), where nodes𝒱\\mathcal\{V\}represent specific decisionstatesand edgesℰ\\mathcal\{E\}represent the possibleactionsorpaths\. Within this graph, we utilize two distinct neural components: aRouter Networkto navigate the graph and anExpert Networkto extract information\.
The data flow followsthreelogical stages:
1. 1\.Global Encoding\(The Global Knowledge Oracle\): the inputxx\(e\.g\. an image or patient profile\) is first processed by a shared neural backbone to produce astaticfeature representation𝐡x∈ℝd\\mathbf\{h\}\_\{x\}\\in\\mathbb\{R\}^\{d\}\. This vector acts as aPersistent Knowledge Oracle\- a constant reference point for the entire system\. Unlike standard feed\-forward networks where data is transformed and “forgotten” layer\-by\-layer,𝐡x\\mathbf\{h\}\_\{x\}is broadcast to all nodes in the DAG\. This ensures that every subsequent decision has access to the full contextual state of the original input777This in some sense serves as a ’residual connection’\.\.
2. 2\.Sequential Traversal\(The Routing Mechanism\): starting at the root nodev0v\_\{0\}, the model traverses a path𝒯\\mathcal\{T\}through the graph\. At each nodevtv\_\{t\}, aRouter NetworkMLPθvt\\text\{MLP\}\_\{\\theta\_\{v\_\{t\}\}\}is activated\. The router’s task is to decide which path to take next\. It takes as input the concatenation of the global features𝐡x\\mathbf\{h\}\_\{x\}and the currentBayesian Belief State𝜶t\\bm\{\\alpha\}\_\{t\}, as well as any optional side information888This side info can be e\.g\. rules learned on a high level of abstraction of the dataset \(e\.g\. physical constraints\), or with different levels of fidelity\. For simplicity, we ignore this extra info in later work\.𝜺t\\bm\{\\varepsilon\}\_\{t\}\. To make a selection, the router outputs a discrete actionata\_\{t\}using aGumbel\-Softmax relaxation\. This specific mathematical technique allows the model to choose a single, “hard” path during inference \(improving efficiency\) while still allowing gradients to flow during training so the model can learn from its mistakes\. This ensures the path taken is conditioned not just on the raw data, but on what the model has already “learned” or concluded in previous steps\.
3. 3\.Evidence Accumulation\(The Bayesian Update\): upon reaching the selected nodevt\+1v\_\{t\+1\}, a localExpert Networkfvt\+1f\_\{v\_\{t\+1\}\}\(parameterized by weights𝐖\\mathbf\{W\}and bias𝐛\\mathbf\{b\}\) interrogates \(queries\) the Oracle𝐡x\\mathbf\{h\}\_\{x\}\. The expert extracts a strictly positiveEvidence Vector𝐞t∈ℝ\+K\\mathbf\{e\}\_\{t\}\\in\\mathbb\{R\}^\{K\}\_\{\+\}\. This evidence is added to the current belief state viaConjugate Addition:𝜶t\+1=𝜶t\+𝐞t\\bm\{\\alpha\}\_\{t\+1\}=\\bm\{\\alpha\}\_\{t\}\+\\mathbf\{e\}\_\{t\}\. This update rule is rooted in Bayesian statistics, specifically using theDirichlet Distribution\. Every addition of evidence “sharpens” the model’s confidence, narrowing the probability distribution over theKKpossible outcomes\. This process repeats, adding more evidence and reducing entropy, until aConfidence Thresholdη\\etais met or a leaf node is reached, at which point a final prediction is rendered\.
A depth\-2 computational graph illustrating the NBSR procedure \(both the discrete forward execution path and the continuous backward gradient flow\) is shown in Fig\.[1](https://arxiv.org/html/2605.26147#S4.F1)\.
Root Node \(v0v\_\{0\}\)Prior:𝜶0=𝟏\\bm\{\\alpha\}\_\{0\}=\\mathbf\{1\}Routerπ\(\[𝐡x⊕𝜶0⊕𝜺0\];θv0\)\\pi\(\[\\mathbf\{h\}\_\{x\}\\oplus\\bm\{\\alpha\}\_\{0\}\\oplus\\bm\{\\varepsilon\}\_\{0\}\];\\theta\_\{v\_\{0\}\}\)Nodev1v\_\{1\}Evidence:𝐞0=f\(𝐡x;𝐖v1,𝐛v1\)\\mathbf\{e\}\_\{0\}=f\(\\mathbf\{h\}\_\{x\};\\mathbf\{W\}\_\{v\_\{1\}\},\\mathbf\{b\}\_\{v\_\{1\}\}\)Update:𝜶1=𝜶0\+𝐞0\\bm\{\\alpha\}\_\{1\}=\\bm\{\\alpha\}\_\{0\}\+\\mathbf\{e\}\_\{0\}Nodev1′v^\{\\prime\}\_\{1\}\(Unvisited\)Routerπ\(\[𝐡x⊕𝜶0⊕𝜺0\];θv0\)\\pi\(\[\\mathbf\{h\}\_\{x\}\\oplus\\bm\{\\alpha\}\_\{0\}\\oplus\\bm\{\\varepsilon\}\_\{0\}\];\\theta\_\{v\_\{0\}\}\)Root Node \(v0v\_\{0\}\)Prior:𝜶0=𝟏\\bm\{\\alpha\}\_\{0\}=\\mathbf\{1\}𝜺0\\bm\{\\varepsilon\}\_\{0\}ℋ\(𝜶1\)<η\\mathcal\{H\}\(\\bm\{\\alpha\}\_\{1\}\)<\\eta?Early ExitPredict using𝜶1\\bm\{\\alpha\}\_\{1\}Routerπ\(\[𝐡x⊕𝜶1⊕𝜺1\];θv1\)\\pi\(\[\\mathbf\{h\}\_\{x\}\\oplus\\bm\{\\alpha\}\_\{1\}\\oplus\\bm\{\\varepsilon\}\_\{1\}\];\\theta\_\{v\_\{1\}\}\)𝜺1\\bm\{\\varepsilon\}\_\{1\}Nodev2v\_\{2\}Evidence:𝐞1=f\(𝐡x;𝐖v2,𝐛v2\)\\mathbf\{e\}\_\{1\}=f\(\\mathbf\{h\}\_\{x\};\\mathbf\{W\}\_\{v\_\{2\}\},\\mathbf\{b\}\_\{v\_\{2\}\}\)Update:𝜶2=𝜶1\+𝐞1\\bm\{\\alpha\}\_\{2\}=\\bm\{\\alpha\}\_\{1\}\+\\mathbf\{e\}\_\{1\}Nodev2′v^\{\\prime\}\_\{2\}\(Unvisited\)Final DecisionPredict using𝜶2\\bm\{\\alpha\}\_\{2\}Total Lossℒ\\mathcal\{L\}NLL\(𝜶2,y∗\)\+λ∑ℋ\(𝜶t\)\\text\{NLL\}\(\\bm\{\\alpha\}\_\{2\},y^\{\*\}\)\+\\lambda\\sum\\mathcal\{H\}\(\\bm\{\\alpha\}\_\{t\}\)Global Knowledge Oracle𝐡x\\mathbf\{h\}\_\{x\}\(Params: e\.g\. CNN Backbone\)Actiona0a\_\{0\}Actiona0′a^\{\\prime\}\_\{0\}YesNoActiona1a\_\{1\}Actiona1′a^\{\\prime\}\_\{1\}∇𝐖v2,∇𝐛v2\\nabla\\mathbf\{W\}\_\{v\_\{2\}\},\\nabla\\mathbf\{b\}\_\{v\_\{2\}\}∇θv1\\nabla\\theta\_\{v\_\{1\}\}\(via GS\)∇𝐖v1,∇𝐛v1\\nabla\\mathbf\{W\}\_\{v\_\{1\}\},\\nabla\\mathbf\{b\}\_\{v\_\{1\}\}∇θv0\\nabla\\theta\_\{v\_\{0\}\}\(via GS\)∇ϕ\\nabla\\phi\(optional\)Figure 1:Computational graph of the NBSR process\. The total number of sequential layers \(maximum tree depth\) and the number of expert nodes at each layer \(branching width\) are predefined hyperparameters that establish the maximal capacity of the super\-graph\. TheGlobal Knowledge Oracleis positioned in the outer loop, while theTotal Lossresides in the inner loop\. The oracle provides a persistent feature set𝐡x\\mathbf\{h\}\_\{x\}that is queried by both routers and experts\.Forward Pass:Path selection is conditioned on current belief and oracle data\.Backward Pass:Gumbel\-Softmax \(GS\) allows end\-to\-end gradient updates\.
### 4\.2The Decision Graph
We formulate the sequential decision\-making process as a traversal over a Directed Acyclic Graph999While the primary formulation describes a strict feed\-forward DAG, this topology can be temporally flattened into anAutoregressive State Machineby sharing a single router and expert pool across time\. As we will demonstrate in Section[6\.6](https://arxiv.org/html/2605.26147#S6.SS6), this cyclical extension prevents combinatorial explosion in long\-horizon sequential planning tasks\.\(DAG\), denoted as𝒢=\(𝒱,ℰ\)\\mathcal\{G\}=\(\\mathcal\{V\},\\mathcal\{E\}\)\. Here,𝒱\\mathcal\{V\}represents the set of computationalnodes\(vertices, acting as local experts and routing gates\), andℰ\\mathcal\{E\}represents the directededgescorresponding to routing actions\. The process originates at a designated root nodev0v\_\{0\}, which acts purely as an initial routing gate without an evidence\-extracting expert\. From there, it sequentially traverses the graph\. At each nodevt∈𝒱v\_\{t\}\\in\\mathcal\{V\}, aconditional routing policydetermines the discrete outgoing edgeat∈𝒜\(vt\)a\_\{t\}\\in\\mathcal\{A\}\(v\_\{t\}\), where𝒜\(vt\)\\mathcal\{A\}\(v\_\{t\}\)is the action set at nodevtv\_\{t\}, to traverse, leading to the subsequent nodevt\+1v\_\{t\+1\}\. This traversal governs the flow of information and concludes when a final prediction is rendered over theKK\-dimensional outcome space\.
While the maximal topology of the graph𝒢\\mathcal\{G\}\(i\.e\. the total number of available layers and expert nodes\) is pre\-specified as a fixed super\-graph, the effective architecture is highly dynamic\. The network autonomously performs implicit pruning during training; if an expert node fails to provide discriminative evidence, the upstream routers learn to assign near\-zero probability to its incoming edges, effectively removing it from the computational flow\. Further, the inference depth is dynamically truncated on a per\-sample basis via the entropy\-based early exiting mechanism\.
By initializing a sufficiently largesuper\-graph, we establish a high\-capacity, low\-bias foundation\. Importantly, the aforementioned dynamic mechanisms act as adaptive regularizers: implicit routing pruning organically restricts theeffective width, while entropy\-based early exiting curtails theeffective depth\. Together, they dynamically rein in the variance, allowing the network to strike an optimal, sample\-specific balance between representational capacity and generalization\. Precise discussions regarding the bias\-variance trade\-off, induced by the graph topology, are presented in a later theoretical Section[5\.4](https://arxiv.org/html/2605.26147#S5.SS4)\.
### 4\.3Global Feature Representation: The Knowledge Oracle
To provide a stable informational foundation for the tree, the inputxx\(e\.g\. visual or tabular data\) is embeddedonceinto a dense feature space to create aGlobal Knowledge Oracle, denoted as𝐡x\\mathbf\{h\}\_\{x\}\. Formally, given a shared neural backbonefϕf\_\{\\phi\}parameterized byϕ\\phi, the oracle is computed as:
𝐡x=fϕ\(x\)\\mathbf\{h\}\_\{x\}=f\_\{\\phi\}\(x\)\(1\)This shared representation acts as a persistent knowledge base that encapsulates the totality of the external data available for a given instance\. For example, for visual tasks, we employ a truncated ResNet\-18 architecture asfϕf\_\{\\phi\}, with the final fully connected classification layer removed, to extract a dense, continuous feature representation𝐡x∈ℝd\\mathbf\{h\}\_\{x\}\\in\\mathbb\{R\}^\{d\}, where e\.g\.d=256d=256\. Rather than transforming the data sequentially layer\-by\-layer, this global oracle𝐡x\\mathbf\{h\}\_\{x\}is broadcast to the entire graph, allowing all active nodes in the routing tree to query it directly to forge node\-specific, local knowledge\. Note that, unlike pixel\-level models such as random forest or CART decision trees101010A comparison between NBSR and decision trees is made in Appendix\.[E](https://arxiv.org/html/2605.26147#A5)\.\[[7](https://arxiv.org/html/2605.26147#bib.bib77)\], which look at one raw feature at each hierarchical splitting node, the experts in NBSR partition the semantic space \(e\.g\. ”Setosa vs\. Non\-Setosa” in Iris classification\), not the raw input space\.
### 4\.4The Bayesian State and Initialization
The ultimate objective is to map the input to one ofKKfinal outcomes \(e\.g\. object categories, clinical endpoints\)\. We model the uncertainty over these outcomes using aDirichlet distribution111111The Dirichlet distribution is a multivariate continuous probability distribution parameterized by a positive concentration vector𝛂\\bm\{\\alpha\}\. As the conjugate prior to the Categorical distribution, it essentially represents a “distribution over distributions,” making it mathematically ideal for modeling structural uncertainty over aKK\-dimensional probability simplex\. Details about Dirichlet distribution can be found in Appendix\.[A](https://arxiv.org/html/2605.26147#A1)\., parameterized by a concentration vector𝜶t∈ℝ\+K\\bm\{\\alpha\}\_\{t\}\\in\\mathbb\{R\}^\{K\}\_\{\+\}; this Dirichlet distribution is sequentially refined along the selected tree trajectory\. If we index the node depth bytt, at the root node \(t=0t=0\), the belief state is initialized to a uniform, maximum\-entropy prior:
𝜶0=𝟏∈ℝK\\bm\{\\alpha\}\_\{0\}=\\mathbf\{1\}\\in\\mathbb\{R\}^\{K\}\(2\)which implies no preference over theKKdecisions at the beginning\.
### 4\.5The Differentiable Router Network
At steptt, located at nodevtv\_\{t\}, the model must select a single discrete branchat∈𝒜\(vt\)a\_\{t\}\\in\\mathcal\{A\}\(v\_\{t\}\)to traverse deeper into the tree\. The routing decision is conditioned on both the input features and the current shape of the model’s uncertainty\.
The router is instantiated as aMulti\-Layer Perceptron121212The router architecture can be chosen as any reasonable network; here we choose MLP as an example\.\(MLP\)\. It accepts as input a concatenation of the global feature vector𝐡x\\mathbf\{h\}\_\{x\}with the current belief state𝜶t\\bm\{\\alpha\}\_\{t\}, alongside any optional side information𝜺t\\bm\{\\varepsilon\}\_\{t\}:
𝐳t=MLPθvt\(\[𝐡x⊕𝜶t⊕𝜺t\]\)\\mathbf\{z\}\_\{t\}=\\text\{MLP\}\_\{\\theta\_\{v\_\{t\}\}\}\(\[\\mathbf\{h\}\_\{x\}\\oplus\\bm\{\\alpha\}\_\{t\}\\oplus\\bm\{\\varepsilon\}\_\{t\}\]\)\(3\)
From a practical implementation standpoint, the router network at any given nodevtv\_\{t\}must have a statically defined output dimension\. In standard deep learning frameworks such as PyTorch\[[61](https://arxiv.org/html/2605.26147#bib.bib16)\]or TensorFlow\[[1](https://arxiv.org/html/2605.26147#bib.bib17)\], the final linear layer of this MLP is hard\-coded at initialization to output exactly\|𝒜\(vt\)\|\|\\mathcal\{A\}\(v\_\{t\}\)\|routing logits𝐳t\\mathbf\{z\}\_\{t\}, corresponding directly to the total number of possible discrete actions \(child nodes\) branching fromvtv\_\{t\}\.
###### Example 1\(MLP Router Architecture for CIFAR\-10 Classification\)\.
For a typical visual categorization task with aKK\-dimensional outcome space and a convolutional backbone extractingdd\-dimensional features, the router is parameterized as follows:
- •Input Dimension:d\+K\+\|𝜺t\|d\+K\+\|\\bm\{\\varepsilon\}\_\{t\}\|\(e\.g\.256\+10=266256\+10=266for standard CIFAR\-10 without side information\)\.
- •Hidden Layer: linear projection to 128 dimensions, followed by a ReLU activation\.
- •Output Layer: linear projection to\|𝒜\(vt\)\|\|\\mathcal\{A\}\(v\_\{t\}\)\|dimensions \(the number of outgoing branches from nodevtv\_\{t\}\)\.
To achieve hard routing while preserving end\-to\-end differentiability, we apply theGumbel\-Softmax trick\[[40](https://arxiv.org/html/2605.26147#bib.bib9),[53](https://arxiv.org/html/2605.26147#bib.bib10)\]to the raw routing logits𝐳t\\mathbf\{z\}\_\{t\}\. Specifically, we first draw i\.i\.d\. noise samplesgkg\_\{k\}from astandard Gumbel distributionto perturb the logits, which enablesstochastic exploration\. We then compute a continuous routing probability vector𝝅\\bm\{\\pi\}, where the probability of selecting branchkkis given by \([cc\.Eq\.32](https://arxiv.org/html/2605.26147#S3.Ex5)\):
πk=exp\(\(zt,k\+gk\)/τ\)∑i=1\|𝒜\(vt\)\|exp\(\(zt,i\+gi\)/τ\)\\pi\_\{k\}=\\frac\{\\exp\(\(z\_\{t,k\}\+g\_\{k\}\)/\\tau\)\}\{\\sum\_\{i=1\}^\{\|\\mathcal\{A\}\(v\_\{t\}\)\|\}\\exp\(\(z\_\{t,i\}\+g\_\{i\}\)/\\tau\)\}\(4\)Here,τ\>0\\tau\>0is a temperature parameter that controls the smoothness of the distribution; asτ→0\\tau\\to 0, the continuous softmax output asymptotically approaches a discrete one\-hot vector\.
To enforce strict conditional execution, ensuring the model only pays the computational cost for a single path, we utilize theStraight\-Through Estimator131313The Straight\-Through Estimator \(STE\) is a technique used to train neural networks with discrete, non\-differentiable operations\. It works by performing the discrete operation \(such asargmaxor thresholding\) during the forward pass, but treating it as a differentiable identity function, or routing gradients through its smooth continuous proxy, during the backward pass\. This prevents the gradients from zeroing out, enabling standard backpropagation through discrete bottlenecks\.\(STE\)\. During the forward training pass, the continuous vector𝝅\\bm\{\\pi\}is discretized using a hardargmax, effectively activating only asinglesubsequent nodevt\+1v\_\{t\+1\}\. During the backward pass, the non\-differentiableargmaxoperation is bypassed, and gradients flow smoothly through the continuous approximation𝝅\\bm\{\\pi\}directly into the router’s parameters \(further mathematical details are provided in Appendix[C](https://arxiv.org/html/2605.26147#A3)\)\.
### 4\.6Evidence Extraction and Belief Update
Once a target nodevt\+1v\_\{t\+1\}is activated, its resident “expert” network extracts local evidence from the input to update the belief state\. Conceptually, this process treats the global feature representation as apersistent knowledge base, driven by three interacting components:
- \-The Global \(Knowledge\) Oracle \(𝐡x\\mathbf\{h\}\_\{x\}\):serves as the static, comprehensive reference of the input instance, as defined previously in Section\.[4\.3](https://arxiv.org/html/2605.26147#S4.SS3)\.
- \-The Context\-Aware Query \(𝐳t\\mathbf\{z\}\_\{t\}\):at steptt, the router asks: “based on what I already know \(i\.e\. the current state𝜶t\\bm\{\\alpha\}\_\{t\}\) and what the oracle holds \(𝐡x\\mathbf\{h\}\_\{x\}\), which expert should I consult next?”
- \-The Information Extractor \(fvt\+1f\_\{v\_\{t\+1\}\}\):the chosen expert networkfvt\+1f\_\{v\_\{t\+1\}\}acts as a specific, selective querying function141414Static vs\. Dynamic Experts:our framework currently employsStatic Expertswith implicit context\. Because the router already used the belief state𝜶t\\bm\{\\alpha\}\_\{t\}to select nodevt\+1v\_\{t\+1\}, the simple act of arriving at this node is implicitly conditioned on𝜶t\\bm\{\\alpha\}\_\{t\}\. The matrix𝐖vt\+1\\mathbf\{W\}\_\{v\_\{t\+1\}\}represents a fixed, specialized filter; it does not need explicit knowledge of𝜶t\\bm\{\\alpha\}\_\{t\}to extract its specific feature\. This static design ensures faster inference, easier training, and conceptual clarity\. An alternative is aDynamic Expertframework formulated asf\(𝐡x,𝜶t;𝐖,𝐛\)f\(\\mathbf\{h\}\_\{x\},\\bm\{\\alpha\}\_\{t\};\\mathbf\{W\},\\mathbf\{b\}\)\. In this paradigm,𝜶t\\bm\{\\alpha\}\_\{t\}acts as an explicit query in anattentionmechanism, while𝐡x\\mathbf\{h\}\_\{x\}acts as the key/value, allowing the expert to dynamically shift its weights based on the model’s exact current uncertainty\. We leave the exploration of dynamic attention\-based experts to future work\.𝐞t=f\(𝐡x;𝐖vt\+1,𝐛vt\+1\)\\mathbf\{e\}\_\{t\}=f\(\\mathbf\{h\}\_\{x\};\\mathbf\{W\}\_\{v\_\{t\+1\}\},\\mathbf\{b\}\_\{v\_\{t\+1\}\}\)\. It interrogates the oracle𝐡x\\mathbf\{h\}\_\{x\}from a highly specialized perspective \(e\.g\. “look specifically for bird\-like features”\) and extracts only the relevant localized evidence𝐞t\\mathbf\{e\}\_\{t\}\.
Mathematically, the evidence extractor typically consists of alinearmapping from the shared feature vector𝐡x\\mathbf\{h\}\_\{x\}to theKK\-dimensional outcome space\. Here, the trainable parameters𝐖vt\+1\\mathbf\{W\}\_\{v\_\{t\+1\}\}and𝐛vt\+1\\mathbf\{b\}\_\{v\_\{t\+1\}\}dictate precisely what information to extract from the global oracle given the specific context of the current node\. To satisfy the strict positivity requirement of the Dirichlet concentration parameters, the output must be passed through a bounding activation function:
𝐞t=Activation\(𝐖vt\+1𝐡x\+𝐛vt\+1\)\\mathbf\{e\}\_\{t\}=\\text\{Activation\}\(\\mathbf\{W\}\_\{v\_\{t\+1\}\}\\mathbf\{h\}\_\{x\}\+\\mathbf\{b\}\_\{v\_\{t\+1\}\}\)\(5\)
The choice of this activation function dictates theepistemic capacityof the expert\. For terminal leaf experts, this is typically the unboundedSoftplusfunction151515The Softplus function is a smooth, differentiable activation function defined asSoftplus\(x\)=ln\(1\+ex\)\\text\{Softplus\}\(x\)=\\ln\(1\+e^\{x\}\), acting as a continuous approximation of the ReLU activation function\. It produces positive outputs ranging from 0 to∞\\infty, commonly used to constrain output values to be positive in neural networks, such as in regression tasks, and has a derivative equal to the sigmoid function\.\(ln\(1\+ex\)\\ln\(1\+e^\{x\}\)\), which allows the network to inject massive evidence to crystallize a final, highly confident decision\. However, to mathematically enforce a hierarchical taxonomy, intermediate routing nodes can utilize a scaledSigmoidactivation\. This places a strict upper bound on the evidence an intermediate node can extract, acting as anepistemic confidence budgetthat ensures broad, early\-stage hypotheses remain appropriately hesitant \(a dynamic visually demonstrated in our toy experiment in Section[6\.1](https://arxiv.org/html/2605.26147#S6.SS1)\)\.
The resulted𝐞t\\mathbf\{e\}\_\{t\}thus serves as aninformation gainto the current nodal decision\-maker\. The belief state is then deterministically updated via conjugate addition161616In Bayesian statistics, the Dirichlet distribution is the conjugate prior for the Categorical distribution\. Traditionally, observing discrete categorical counts𝐜\\mathbf\{c\}updates a priorDir\(𝜶\)\\text\{Dir\}\(\\bm\{\\alpha\}\)to a posteriorDir\(𝜶\+𝐜\)\\text\{Dir\}\(\\bm\{\\alpha\}\+\\mathbf\{c\}\)\. Our framework generalizes this property by treating the neural network’s continuous, positive evidence vectors𝐞t\\mathbf\{e\}\_\{t\}as pseudo\-counts\. According to Bayes’ theorem, the posterior distribution over the class probabilities𝐩\\mathbf\{p\}is given byP\(𝐩∣𝐞t\)∝P\(𝐞t∣𝐩\)P\(𝐩∣𝜶t\)∝∏k=1Kpk\(αt,k\+et,k\)−1P\(\\mathbf\{p\}\\mid\\mathbf\{e\}\_\{t\}\)\\propto P\(\\mathbf\{e\}\_\{t\}\\mid\\mathbf\{p\}\)P\(\\mathbf\{p\}\\mid\\bm\{\\alpha\}\_\{t\}\)\\propto\\prod\_\{k=1\}^\{K\}p\_\{k\}^\{\(\\alpha\_\{t,k\}\+e\_\{t,k\}\)\-1\}\. This guarantees that the posterior remains a Dirichlet distribution parameterized by𝜶t\+𝐞t\\bm\{\\alpha\}\_\{t\}\+\\mathbf\{e\}\_\{t\}, effectively performing a differentiable, continuous analog of exact Bayesian updating\. See Appendix[A](https://arxiv.org/html/2605.26147#A1)for extended mathematical details\., accumulating the evidence gathered by traversing the chosen path:
𝜶t\+1=𝜶t\+𝐞t\\bm\{\\alpha\}\_\{t\+1\}=\\bm\{\\alpha\}\_\{t\}\+\\mathbf\{e\}\_\{t\}\(6\)
###### Example 2\(CIFAR\-10 Evidence Extractor\)\.
Following the previous visual categorization setup in Example 1, the specific architecture for the local evidence extractor at each terminal nodevt\+1v\_\{t\+1\}is configured as follows:
- •Input layer: receives thedd\-dimensional global feature vector𝐡x\\mathbf\{h\}\_\{x\}\(e\.g\.d=256d=256from the truncated ResNet\-18 backbone\)\.
- •Transformation: a single linear fully\-connected layer, parameterized by a weight matrix𝐖vt\+1∈ℝK×d\\mathbf\{W\}\_\{v\_\{t\+1\}\}\\in\\mathbb\{R\}^\{K\\times d\}and bias𝐛vt\+1∈ℝK\\mathbf\{b\}\_\{v\_\{t\+1\}\}\\in\\mathbb\{R\}^\{K\}\.
- •Activation: an unbounded Softplus activation function is applied to the raw logits to guarantee mapping into the strictly positive domain \(ℝ\+K\\mathbb\{R\}\_\{\+\}^\{K\}\)\.
- •Output dimension: aKK\-dimensional evidence vector𝐞t\\mathbf\{e\}\_\{t\}\(e\.g\.K=10K=10representing the isolated evidence for each categorical outcome\)\.
### 4\.7Training Objective and Dynamics
The routing tree is trained end\-to\-end tomaximize the expected likelihood of the correct class\. For a single data instancexxwith ground truth outcomey∗y^\{\*\}, we optimize theNegative Log\-Likelihood\(NLL\) of the final Dirichlet distribution𝜶T\\bm\{\\alpha\}\_\{T\}at the terminal leaf node\. This is augmented with an intermediate entropy penaltyℋ\(⋅\)\\mathcal\{H\}\(\\cdot\)to explicitly enforce the progressive sharpening of the decision boundary along the sampled trajectory𝒯\\mathcal\{T\}:
ℒ\(x,y∗\)=−log\(αT,y∗∑k=1KαT,k\)\+λ∑vt∈𝒯ℋ\(𝜶t\)\\mathcal\{L\}\(x,y^\{\*\}\)=\-\\log\\left\(\\frac\{\\alpha\_\{T,y^\{\*\}\}\}\{\\sum\_\{k=1\}^\{K\}\\alpha\_\{T,k\}\}\\right\)\+\\lambda\\sum\_\{v\_\{t\}\\in\\mathcal\{T\}\}\\mathcal\{H\}\(\\bm\{\\alpha\}\_\{t\}\)\(7\)whereλ\\lambdais a tunable hyperparameter controlling the routing efficiency and the aggressiveness of the uncertainty reduction\. This loss design forces the model to maximize the expected probability of the ground\-truth class, while sharpening our belief along the traversed tree path\.
During end\-to\-end training over a dataset𝒟\\mathcal\{D\}, the model minimizesthe overall empirical riskℛ\\mathcal\{R\}, defined as the average loss across all instances\. The optimal parameter setΘ∗\\Theta^\{\*\}is obtained by solving:
Θ∗=argminΘℛ\(Θ\)=argminΘ1\|𝒟\|∑\(xi,yi∗\)∈𝒟ℒ\(xi,yi∗;Θ\)\\Theta^\{\*\}=\\arg\\min\_\{\\Theta\}\\mathcal\{R\}\(\\Theta\)=\\arg\\min\_\{\\Theta\}\\frac\{1\}\{\|\\mathcal\{D\}\|\}\\sum\_\{\(x\_\{i\},y^\{\*\}\_\{i\}\)\\in\\mathcal\{D\}\}\\mathcal\{L\}\(x\_\{i\},y^\{\*\}\_\{i\};\\Theta\)\(8\)whereΘ\\Thetaencompasses all trainable parameters optimized by this objective within the framework, specifically: the shared global feature backbone171717As discussed later in Section\.[4\.9](https://arxiv.org/html/2605.26147#S4.SS9), training the backbone is optional; it can be either pre\-trained \(frozen\) or trainable\.\(ϕ\\phi\), which generates the oracle state𝐡x\\mathbf\{h\}\_\{x\}; the router parameters \(θvt\\theta\_\{v\_\{t\}\}\), which dictate the sampled path𝒯\\mathcal\{T\}; and the local evidence extractor weights and biases \(𝐖vt,𝐛vt\\mathbf\{W\}\_\{v\_\{t\}\},\\mathbf\{b\}\_\{v\_\{t\}\}\), which compute the evidence vectors𝐞t\\mathbf\{e\}\_\{t\}that sequentially construct the belief states𝜶t\\bm\{\\alpha\}\_\{t\}\.
Training the tree structure admits a highly efficient, sample\-specific gradient flow\. Because of the discrete routing decisions made during the forward pass, gradients only propagate through the active trajectory𝒯\\mathcal\{T\}\. Consequently,unvisited expertsreceive zero gradients and remain unmodified by that specific instance\. Conversely,active expertsalong the traversed path are directly optimized to extract more discriminative evidence from the global oracle\. Simultaneously, theactive routerslearn to adjust their conditional routing probabilities via the continuous backward pass of the Gumbel\-Softmax estimator; if a chosen path yields a high loss, the router penalizes the corresponding action logit, which inherently boosts the probability of selecting an alternative, unvisited branch in the future\. This dynamic interplay of selective evidence extraction and adaptive routing is precisely what allows the tree to autonomously specialize, naturally evolving the computational graph into a highly structured, conditionalMixture\-of\-Experts\(MoE\)\.
### 4\.8Inference, Early Exiting, and Epistemic Abstention
During inference, the stochasticity introduced by the Gumbel noise during training is removed\. At each nodevtv\_\{t\}, the router selects the deterministic path by taking the strictargmaxof the routing logits𝐳t\\mathbf\{z\}\_\{t\}\.
A core advantage of our Bayesian formulation is the natural affordance fordynamic early exiting\. Because the concentration vector𝜶t\\bm\{\\alpha\}\_\{t\}explicitly quantifies the model’s uncertainty, we can halt the routing process dynamically\. If, at steptt, the differential entropy of the belief stateℋ\(𝜶t\)\\mathcal\{H\}\(\\bm\{\\alpha\}\_\{t\}\)falls below a predefined confidence thresholdη\\eta, the traversal can terminate immediately \(likeearly stopping\)\. The expected prediction is then derived from the current Dirichlet distribution:𝔼\[pk\]=αt,k∑iαt,i\\mathbb\{E\}\[p\_\{k\}\]=\\frac\{\\alpha\_\{t,k\}\}\{\\sum\_\{i\}\\alpha\_\{t,i\}\}\. This mechanism ensures that unambiguous inputs require shallower graph traversals, reducing inference FLOPs while preserving performance\.
Further, this formulation intrinsically safeguards against hallucination\. By explicitly tracking the total precisionα0\\alpha\_\{0\}, the framework provides a nativeOut\-Of\-Distribution \(OOD\) abstentionmechanism\. If the epistemic uncertaintyu=Kα0u=\\frac\{K\}\{\\alpha\_\{0\}\}remains above a critical safety threshold even after full graph traversal, the network recognizes its own lack of extracted evidence and gracefully abstains from prediction, ensuring robust safety in high\-stakes environments\.
### 4\.9Training the NBSR Tree
We present the end\-to\-end training process in Algo\.[1](https://arxiv.org/html/2605.26147#alg1), which explicitly details both the forward inference trajectory and the backward gradient flow, demonstrating how conditional execution dynamically restricts the computational cost\. Importantly, the backward pass highlights that the tree organically prunes its gradient updates: only the routers and experts along the actively sampled trajectory𝒯\\mathcal\{T\}receive learning signals for any given instance\.
Algorithm 1End\-to\-End Training of Neural Bayesian Sequential Routing Tree0:Training dataset
𝒟\\mathcal\{D\}, initialized parameters
Θ=\{ϕ,θ,𝐖,𝐛\}\\Theta=\\\{\\phi,\\theta,\\mathbf\{W\},\\mathbf\{b\}\\\}, learning rate
η\\eta
0:Max depth
LL, branching factor
W¯\\bar\{W\}\(action space\), confidence threshold
η\\eta, entropy penalty
λ\\lambda
1:whilenot convergeddo
2:Sample a minibatch
ℬ⊂𝒟\\mathcal\{B\}\\subset\\mathcal\{D\}
3:Initialize batch loss
ℛ←0\\mathcal\{R\}\\leftarrow 0
4:foreach instance
\(x,y∗\)∈ℬ\(x,y^\{\*\}\)\\in\\mathcal\{B\}do
5:// 1\. Forward Pass & Trajectory Sampling
6:
t←0t\\leftarrow 0,
vt←v0v\_\{t\}\\leftarrow v\_\{0\}\(Root\),
𝜶0←𝟏∈ℝK\\bm\{\\alpha\}\_\{0\}\\leftarrow\\mathbf\{1\}\\in\\mathbb\{R\}^\{K\}
7:
𝒯\(x\)←\{v0\}\\mathcal\{T\}^\{\(x\)\}\\leftarrow\\\{v\_\{0\}\\\}
8:
𝐡x←fϕ\(x\)\\mathbf\{h\}\_\{x\}\\leftarrow f\_\{\\phi\}\(x\)\{Global Knowledge Oracle\}
9:while
t<Lt<Ldo
10:if
ℋ\(𝜶t\)<η\\mathcal\{H\}\(\\bm\{\\alpha\}\_\{t\}\)<\\etathen
11:break\{Uncertainty sufficiently resolved; exit early\}
12:endif
13:
𝐳t←MLPθvt\(\[𝐡x⊕𝜶t⊕𝜺t\]\)\\mathbf\{z\}\_\{t\}\\leftarrow\\text\{MLP\}\_\{\\theta\_\{v\_\{t\}\}\}\(\[\\mathbf\{h\}\_\{x\}\\oplus\\bm\{\\alpha\}\_\{t\}\\oplus\\bm\{\\varepsilon\}\_\{t\}\]\)
14:
at∼Gumbel\-Softmax\(𝐳t\)a\_\{t\}\\sim\\text\{Gumbel\-Softmax\}\(\\mathbf\{z\}\_\{t\}\)\{Sample discrete path via Straight\-Through Estimator\}
15:
vt\+1←TargetNode\(vt,at\)v\_\{t\+1\}\\leftarrow\\text\{TargetNode\}\(v\_\{t\},a\_\{t\}\)
16:
𝒯\(x\)←𝒯\(x\)∪\{vt\+1\}\\mathcal\{T\}^\{\(x\)\}\\leftarrow\\mathcal\{T\}^\{\(x\)\}\\cup\\\{v\_\{t\+1\}\\\}
17:
𝐞t←Activation\(𝐖vt\+1𝐡x\+𝐛vt\+1\)\\mathbf\{e\}\_\{t\}\\leftarrow\\text\{Activation\}\(\\mathbf\{W\}\_\{v\_\{t\+1\}\}\\mathbf\{h\}\_\{x\}\+\\mathbf\{b\}\_\{v\_\{t\+1\}\}\)
18:
𝜶t\+1←𝜶t\+𝐞t\\bm\{\\alpha\}\_\{t\+1\}\\leftarrow\\bm\{\\alpha\}\_\{t\}\+\\mathbf\{e\}\_\{t\}\{Bayesian Conjugate Update\}
19:
t←t\+1t\\leftarrow t\+1
20:endwhile
21:
T←tT\\leftarrow t\{Record terminal step\}
22:// 2\. Instance Loss Computation
23:
ℒNLL\(x\)←−log\(αT,y∗∑k=1KαT,k\)\\mathcal\{L\}\_\{\\text\{NLL\}\}^\{\(x\)\}\\leftarrow\-\\log\\left\(\\frac\{\\alpha\_\{T,y^\{\*\}\}\}\{\\sum\_\{k=1\}^\{K\}\\alpha\_\{T,k\}\}\\right\)
24:
ℒReg\(x\)←λ∑i=0Tℋ\(𝜶i\)\\mathcal\{L\}\_\{\\text\{Reg\}\}^\{\(x\)\}\\leftarrow\\lambda\\sum\_\{i=0\}^\{T\}\\mathcal\{H\}\(\\bm\{\\alpha\}\_\{i\}\)
25:
ℛ←ℛ\+\(ℒNLL\(x\)\+ℒReg\(x\)\)\\mathcal\{R\}\\leftarrow\\mathcal\{R\}\+\\big\(\\mathcal\{L\}\_\{\\text\{NLL\}\}^\{\(x\)\}\+\\mathcal\{L\}\_\{\\text\{Reg\}\}^\{\(x\)\}\\big\)
26:endfor
27:// 3\. Backward Pass \(Gradient Computation\)
28:
ℛ←1\|ℬ\|ℛ\\mathcal\{R\}\\leftarrow\\frac\{1\}\{\|\\mathcal\{B\}\|\}\\mathcal\{R\}\{Average empirical risk over minibatch\}
29:Compute gradients
∇Θℛ\\nabla\_\{\\Theta\}\\mathcal\{R\}via backpropagation
30:\{Note:For any instance
xx,
∇𝐖vℛ=0\\nabla\_\{\\mathbf\{W\}\_\{v\}\}\\mathcal\{R\}=0and
∇θvℛ=0\\nabla\_\{\\theta\_\{v\}\}\\mathcal\{R\}=0for all nodes
v∉𝒯\(x\)v\\notin\\mathcal\{T\}^\{\(x\)\}\}
31:// 4\. Parameter Update
32:
Θ←OptimizerStep\(Θ,∇Θℛ,η\)\\Theta\\leftarrow\\text\{OptimizerStep\}\(\\Theta,\\nabla\_\{\\Theta\}\\mathcal\{R\},\\eta\)
33:endwhile
34:Return:Optimized parameters
Θ∗\\Theta^\{\*\}
##### Computational Complexity
The resource\-rationality of the NBSR framework stems directly from its decoupling of the structural graph capacity \(which is exponentially large\) from the active computational path \(which is strictly linear and dynamically truncated\)\. We formalize this across time and space domains:
- \-Inference Time Complexity \(FLOPs\):let𝒪\(Cbackbone\)\\mathcal\{O\}\(C\_\{\\text\{backbone\}\}\)denote the complexity of the global feature extractor\. The routing tree has a maximum depthLLand branching factorW¯\\bar\{W\}\. Because of the hard\-routingargmaxstep during inference \(or the STE during training\), the input traversesexactly one path\. The computational cost at any single active node consists of the Router MLP operations𝒪\(d⋅dh\+dh⋅W¯\)\\mathcal\{O\}\(d\\cdot d\_\{h\}\+d\_\{h\}\\cdot\\bar\{W\}\)and the Expert linear projection𝒪\(d⋅K\)\\mathcal\{O\}\(d\\cdot K\), whereddis the oracle dimension anddhd\_\{h\}is the router’s hidden size\. Importantly, because of the entropy thresholdη\\eta, the actual number of steps evaluated is a sample\-dependent random variableT\(x\)≤LT\(x\)\\leq L\. The expected total inference complexity is therefore: 𝒪\(Cbackbone\+𝔼\[T\(x\)\]⋅\(d⋅dh\+d⋅K\)\)\\mathcal\{O\}\\Big\(C\_\{\\text\{backbone\}\}\+\\mathbb\{E\}\[T\(x\)\]\\cdot\(d\\cdot d\_\{h\}\+d\\cdot K\)\\Big\)\(9\)Becaused,K,d,K,anddhd\_\{h\}are typically small \(e\.g\.d=256,K=10d=256,K=10\), the routing overhead𝔼\[T\(x\)\]\\mathbb\{E\}\[T\(x\)\]is vastly marginalized by the backbone cost\. Further, because unambiguous samples triggerT\(x\)≪LT\(x\)\\ll L, the model activelysavescomputational cycles compared to static deep networks that must execute all layers for all inputs\.
- \-Space and Parameter Complexity:a fully populated super\-graph of depthLLcontains𝒪\(W¯L\)\\mathcal\{O\}\(\\bar\{W\}^\{L\}\)discrete nodes\. The total parameter footprint of the graph scales exponentially:𝒪\(W¯L⋅d\(dh\+K\)\)\\mathcal\{O\}\\big\(\\bar\{W\}^\{L\}\\cdot d\(d\_\{h\}\+K\)\\big\)\. While this demands sufficient GPU VRAM for storage during training, theactive memory footprintduring a forward pass breaks this exponential scaling\. Because unvisited nodes are bypassed via conditional execution, the active activation memory scales strictly linearly with the traversal depth, bounded by𝒪\(L⋅\(dh\+K\)\)\\mathcal\{O\}\(L\\cdot\(d\_\{h\}\+K\)\)\. This makes the NBSR framework highly scalable to wide hierarchies during inference without causing combinatorial explosions in activation memory\.
##### Practical Implementation Tricks
A critical architectural consideration is whether to train the backbone parametersϕ\\phiend\-to\-end or keep them frozen as a static feature extractor\. A permanently frozen backbone offers significant computational advantages: first, it saves training compute and memory181818Training a complex backbone, e\.g\. ResNet\-18\[[28](https://arxiv.org/html/2605.26147#bib.bib71)\]or ViT\[[13](https://arxiv.org/html/2605.26147#bib.bib15)\], requires storing intermediate activations or compute gradients for millions of parameters during the backward pass; avoiding training the backbone allows the routing tree to be trained on smaller hardware constraints with vastly larger batch sizes\.\. Further, freezingϕ\\phidrastically improves early\-stage training stability\. Because Gumbel\-Softmax routing is inherently highly stochastic in early epochs, an actively updating backbone causes the feature space𝐡x\\mathbf\{h\}\_\{x\}to shift wildly, impeding the local experts’ ability to learn reliable extraction filters\. Utilizing frozen foundation models \(e\.g\. DINOv2\[[58](https://arxiv.org/html/2605.26147#bib.bib13)\]or CLIP\[[64](https://arxiv.org/html/2605.26147#bib.bib14)\]\) provides an exceptionally rich, anchored representation space for the routers to latch onto immediately; NBSR tree can focus entirely on learning the logical reasoning and routing rather than basic feature edge\-detection\.
However, a strictly frozen oracle imposes arepresentation bottleneck\[[36](https://arxiv.org/html/2605.26147#bib.bib80)\]\. Pre\-trained features are optimal for their source tasks \(e\.g\. general ImageNet categorization\) and may lack the specific, fine\-grained granularity required by downstream experts for specialized domains \(e\.g\. medical diagnostics\)\. Further, freezing preventssynergistic adaptation, forcing the tree to adapt to an arbitrary manifold rather than allowing the backbone to organically organize its latent space to support the routing hierarchy, which can artificially cap maximal accuracy\.
To reconcile these trade\-offs, we recommend aTwo\-Stage Training Strategy\(Backbone Warm\-up\)\. InStage 1, the pre\-trained global oracleϕ\\phiisfrozen, and only the routers \(θvt\\theta\_\{v\_\{t\}\}\) and experts \(𝐖vt,𝐛vt\\mathbf\{W\}\_\{v\_\{t\}\},\\mathbf\{b\}\_\{v\_\{t\}\}\) are trained for the first majority of the epochs \(e\.g\. 70%\)\. This allows the tree topology to discover stable routing paths and reliable evidence extractors without a shifting foundation\. InStage 2, the backboneϕ\\phiisunfrozen, and the entire system is fine\-tuned end\-to\-end using a significantly reduced learning rate \(e\.g\. scaled by10−110^\{\-1\}or10−210^\{\-2\}\)\. This stage allows the backbone to subtly adjust its feature space to perfectly serve the highly specialized experts that emerged during the warm\-up phase, maximizing capacity without sacrificing stability\.
## 5Theoretical Analysis
We provide a theoretical characterization of the NBSR framework\. We analyze the structural guarantees of the evidence accumulation process, asymptotic consistency of the learning objective, mathematical implications of the routing hyperparameters, and the topological bias\-variance trade\-off induced by graph complexity\.
### 5\.1Monotonicity of Precision and Variance Reduction
A primary cognitive claim of our framework is that traversing deeper into the routing graph strictly “sharpen” the model’s hypotheses\. We can formally guarantee this behavior regardless of the network’s optimized weight values, relying solely on the strict positivity of the chosen bounding activation \(e\.g\. Softplus or scaled Sigmoid\) andDirichlet conjugate updates\.
Letα0\(t\)=∑k=1Kαt,k\\alpha\_\{0\}^\{\(t\)\}=\\sum\_\{k=1\}^\{K\}\\alpha\_\{t,k\}denote the precision \(or total concentration\) of the Dirichlet belief state at stept=0,1,…,Tt=0,1,\\dots,T, whereTTdenotes the terminal step of the sampled trajectory𝒯\\mathcal\{T\}for a given input \(reached either at a predefined leaf node or dynamically via the entropy\-based early exiting mechanism\)\.
###### Theorem 1\(Strict Precision Monotonicity and Bounded Variance\)\.
For any valid inputxxand any sampled routing trajectory𝒯\\mathcal\{T\}, the precision of the belief state strictly monotonically increases with tree depth:α0\(t\+1\)\>α0\(t\)\\alpha\_\{0\}^\{\(t\+1\)\}\>\\alpha\_\{0\}^\{\(t\)\}\. Consequently, the variance of the expected marginal probability for any classkkshrinks at a rate bounded by𝒪\(1α0\(t\)\)\\mathcal\{O\}\\left\(\\frac\{1\}\{\\alpha\_\{0\}^\{\(t\)\}\}\\right\)\.
###### Proof\.
By definition, the initial state is𝜶0=𝟏\\bm\{\\alpha\}\_\{0\}=\\mathbf\{1\}, soα0\(0\)=K\\alpha\_\{0\}^\{\(0\)\}=K\. At each steptt, the extracted evidence is computed as𝐞t=Activation\(𝐖vt\+1𝐡x\+𝐛vt\+1\)\\mathbf\{e\}\_\{t\}=\\text\{Activation\}\(\\mathbf\{W\}\_\{v\_\{t\+1\}\}\\mathbf\{h\}\_\{x\}\+\\mathbf\{b\}\_\{v\_\{t\+1\}\}\)\. Because the range of our allowable activation functions, whether the unbounded Softplus\(0,∞\)\(0,\\infty\)for terminal leaves or the bounded scaled Sigmoid\(0,C\)\(0,C\)for intermediate experts, is strictly positive, every extracted element satisfieset,k\>0e\_\{t,k\}\>0\. The conjugate update rule𝜶t\+1=𝜶t\+𝐞t\\bm\{\\alpha\}\_\{t\+1\}=\\bm\{\\alpha\}\_\{t\}\+\\mathbf\{e\}\_\{t\}dictates that the new total precision isα0\(t\+1\)=α0\(t\)\+∑ket,k\\alpha\_\{0\}^\{\(t\+1\)\}=\\alpha\_\{0\}^\{\(t\)\}\+\\sum\_\{k\}e\_\{t,k\}\. Since∑ket,k\>0\\sum\_\{k\}e\_\{t,k\}\>0, it strictly follows thatα0\(t\+1\)\>α0\(t\)\\alpha\_\{0\}^\{\(t\+1\)\}\>\\alpha\_\{0\}^\{\(t\)\}\.
Further, the variance of thekk\-th marginal \(see Appendix[A](https://arxiv.org/html/2605.26147#A1)\) isVar\[pk\(t\)\]=αt,k\(α0\(t\)−αt,k\)\(α0\(t\)\)2\(α0\(t\)\+1\)\\text\{Var\}\[p\_\{k\}^\{\(t\)\}\]=\\frac\{\\alpha\_\{t,k\}\(\\alpha\_\{0\}^\{\(t\)\}\-\\alpha\_\{t,k\}\)\}\{\(\\alpha\_\{0\}^\{\(t\)\}\)^\{2\}\(\\alpha\_\{0\}^\{\(t\)\}\+1\)\}\. The numerator represents the product of two terms,αt,k\\alpha\_\{t,k\}and\(α0\(t\)−αt,k\)\(\\alpha\_\{0\}^\{\(t\)\}\-\\alpha\_\{t,k\}\), whose sum is fixed atα0\(t\)\\alpha\_\{0\}^\{\(t\)\}\. By the Arithmetic Mean\-Geometric Mean \(AM\-GM\) inequality, this product is maximized when the two terms are equal \(i\.e\. whenαt,k=12α0\(t\)\\alpha\_\{t,k\}=\\frac\{1\}\{2\}\\alpha\_\{0\}^\{\(t\)\}\)\. This yields the strict upper boundαt,k\(α0\(t\)−αt,k\)≤14\(α0\(t\)\)2\\alpha\_\{t,k\}\(\\alpha\_\{0\}^\{\(t\)\}\-\\alpha\_\{t,k\}\)\\leq\\frac\{1\}\{4\}\(\\alpha\_\{0\}^\{\(t\)\}\)^\{2\}\. Applying this bound to the variance equation, we see the variance is strictly upper\-bounded by14\(α0\(t\)\+1\)\\frac\{1\}\{4\(\\alpha\_\{0\}^\{\(t\)\}\+1\)\}\. Sinceα0\(t\)\\alpha\_\{0\}^\{\(t\)\}strictly increases at every step, this proves that the variance inevitably collapses as evidence accumulates\. ∎
###### Corollary 1\(Epistemic Collapse\)\.
A direct consequence of Theorem[1](https://arxiv.org/html/2605.26147#Thmtheorem1)is that the epistemic uncertainty, defined in Subjective Logic asu=Kα0\(t\)u=\\frac\{K\}\{\\alpha\_\{0\}^\{\(t\)\}\}, strictly monotonically decreases along any valid trajectory\. This mathematical guarantee ensures that in\-distribution samples will eventually cross the safety threshold, whereas out\-of\-distribution \(OOD\) anomalies, which elicit near\-zero evidence from the experts, will leaveα0\\alpha\_\{0\}stagnant, safely trapping the system in a state of high uncertainty \(u≈1u\\approx 1\) and triggering abstention\.
### 5\.2Asymptotic Consistency of the Belief State
Deep neural networks are generally non\-convex, meaning convergence to a global minimum cannot be guaranteed during gradient descent\. However, we can analyze theasymptotic consistencyof our framework under the standard assumption of infinite model capacity \(the Universal Approximation Theorem\[[10](https://arxiv.org/html/2605.26147#bib.bib1),[33](https://arxiv.org/html/2605.26147#bib.bib2)\]\)\.
###### Theorem 2\(Bayes Optimal Consistency\)\.
Assume the global feature backbonefϕ\(⋅\)f\_\{\\phi\}\(\\cdot\)and the local expertsfv\(⋅\)f\_\{v\}\(\\cdot\)possess infinite functional capacity\. If the training objectiveℒ\\mathcal\{L\}\(Eq\.[7](https://arxiv.org/html/2605.26147#S4.E7)\) achieves its global minimum, the expected probability distribution derived from the terminal Dirichlet state𝛂T\\bm\{\\alpha\}\_\{T\}exactly recovers the true conditional data distribution\. That is,𝔼\[pk∣𝛂T\]=P\(y=k∣x\)\\mathbb\{E\}\[p\_\{k\}\\mid\\bm\{\\alpha\}\_\{T\}\]=P\(y=k\\mid x\)\.
###### Proof\.
Based on the Universal Approximation Theorem\[[10](https://arxiv.org/html/2605.26147#bib.bib1),[33](https://arxiv.org/html/2605.26147#bib.bib2)\], the assumption of infinite functional capacity is theoretically satisfied if the neural networks are constructed with either an arbitrarily large number of hidden units \(infinite width\) or an arbitrarily large number of hidden layers \(infinite depth\), paired with a non\-polynomial continuous activation function \(such as ReLU\)\. Under these conditions, the networks can approximate any continuous Borel measurable function on a compact domain to an arbitrary degree of precision\.
Setting the entropy penalty coefficientλ→0\\lambda\\to 0for the analysis of the primary objective, the regularization term vanishes\. From the properties of the Dirichlet distribution, the expected marginal probability for the target classy∗y^\{\*\}is exactly𝔼\[py∗∣𝜶T\]=αT,y∗∑k=1KαT,k\\mathbb\{E\}\[p\_\{y^\{\*\}\}\\mid\\bm\{\\alpha\}\_\{T\}\]=\\frac\{\\alpha\_\{T,y^\{\*\}\}\}\{\\sum\_\{k=1\}^\{K\}\\alpha\_\{T,k\}\}\. Substituting this into the primary loss term in Eq\.[7](https://arxiv.org/html/2605.26147#S4.E7)reduces it directly to the Negative Log\-Likelihood \(NLL\) of the model’s expected prediction for a single data instance\(x,y∗\)\(x,y^\{\*\}\):ℒNLL=−log𝔼\[py∗∣𝜶T\]\\mathcal\{L\}\_\{\\text\{NLL\}\}=\-\\log\\mathbb\{E\}\[p\_\{y^\{\*\}\}\\mid\\bm\{\\alpha\}\_\{T\}\]\.
To demonstrate why minimizing this NLL yields the true posterior, we define the model’s predicted conditional distribution for any classyygiven inputxxasq\(y∣x\)=𝔼\[py∣𝜶T\(x\)\]q\(y\\mid x\)=\\mathbb\{E\}\[p\_\{y\}\\mid\\bm\{\\alpha\}\_\{T\}\(x\)\]\. The instance\-wise NLL is therefore simply−logq\(y∗∣x\)\-\\log q\(y^\{\*\}\\mid x\)\. During end\-to\-end training, minimizing the empirical risk over a dataset corresponds to minimizing theexpected NLLover the true joint data distributionPdata\(x,y\)P\_\{\\text\{data\}\}\(x,y\)\. This expected loss is mathematically equivalent to thecross\-entropybetween the true data distribution and the model’s predicted distribution, which can be decomposed into the Shannon entropy of the data and the Kullback\-Leibler \(KL\) divergence:
𝔼x,y∼Pdata\[ℒNLL\]\\displaystyle\\mathbb\{E\}\_\{x,y\\sim P\_\{\\text\{data\}\}\}\\left\[\\mathcal\{L\}\_\{\\text\{NLL\}\}\\right\]=𝔼x,y∼Pdata\[−logq\(y∣x\)\]\\displaystyle=\\mathbb\{E\}\_\{x,y\\sim P\_\{\\text\{data\}\}\}\\left\[\-\\log q\(y\\mid x\)\\right\]=𝔼x\[∑k=1KP\(y=k∣x\)\(−logq\(y=k∣x\)\)\]\\displaystyle=\\mathbb\{E\}\_\{x\}\\left\[\\sum\_\{k=1\}^\{K\}P\(y=k\\mid x\)\(\-\\log q\(y=k\\mid x\)\)\\right\]=𝔼x\[ℋ\(P\(y∣x\)\)\+DKL\(P\(y∣x\)∥q\(y∣x\)\)\]\\displaystyle=\\mathbb\{E\}\_\{x\}\\left\[\\mathcal\{H\}\(P\(y\\mid x\)\)\+D\_\{\\text\{KL\}\}\(P\(y\\mid x\)\\parallel q\(y\\mid x\)\)\\right\]Because the true data entropyℋ\(P\(y∣x\)\)\\mathcal\{H\}\(P\(y\\mid x\)\)is an irreducible constant with respect to the network parameters,minimizing the expected training objective is mathematically equivalent to minimizing the KL divergence\. A fundamental property of KL divergence is Gibbs’ inequality, which statesDKL\(P∥q\)≥0D\_\{\\text\{KL\}\}\(P\\parallel q\)\\geq 0, with equality holding if and only ifP\(y∣x\)=q\(y∣x\)P\(y\\mid x\)=q\(y\\mid x\)almost everywhere\.
Since we assume the hypothesis class has infinite functional capacity, the network is not constrained by approximation error\. Therefore, achieving the global minimum of the loss forcesDKL→0D\_\{\\text\{KL\}\}\\to 0, which consequently adjusts the accumulated evidence∑t=0T−1𝐞t\\sum\_\{t=0\}^\{T\-1\}\\mathbf\{e\}\_\{t\}such that the final Dirichlet expectation converges perfectly to the true posterior:αT,k∑iαT,i→P\(y=k∣x\)\\frac\{\\alpha\_\{T,k\}\}\{\\sum\_\{i\}\\alpha\_\{T,i\}\}\\to P\(y=k\\mid x\)\. ∎
### 5\.3Hyperparameter Dynamics and Information Acquisition
The practical behavior of the sequential router is governed by two key hyperparameters: the early\-exiting thresholdη\\etaand the entropy penalty coefficientλ\\lambda\.
###### Proposition 1\(λ\\lambdaas an Information Acquisition Accelerator\)\.
The penalty termλ∑vt∈𝒯ℋ\(𝛂t\)\\lambda\\sum\_\{v\_\{t\}\\in\\mathcal\{T\}\}\\mathcal\{H\}\(\\bm\{\\alpha\}\_\{t\}\)in Eq\.[7](https://arxiv.org/html/2605.26147#S4.E7)acts as a temporal regularizer on information acquisition\. Because differential entropyℋ\(𝛂\)\\mathcal\{H\}\(\\bm\{\\alpha\}\)decreases as the magnitude of𝛂\\bm\{\\alpha\}increases, minimizing this sum forces the network to minimize entropy as early as possible in the trajectory\.
Therefore, increasingλ\\lambdaincentivizes the early\-stage experts to extract larger, more discriminative evidence vectors𝐞t\\mathbf\{e\}\_\{t\}\. This results in a steeper descent in the entropy curve\.
Conversely, the inference thresholdη\\etaexplicitly bounds the maximum tolerated uncertainty\. If the user sets a highly restrictive \(low\)η\\eta, the model is forced to route deeper into the tree, aggregating more evidence vectors untilℋ\(𝜶t\)<η\\mathcal\{H\}\(\\bm\{\\alpha\}\_\{t\}\)<\\eta\. Together,λ\\lambdaandη\\etaallow for strict, tunable control over the accuracy\-computation trade\-off:λ\\lambdatrains the experts to act decisively, whileη\\etaensures the system does not act prematurely\.
### 5\.4Topological Bias\-Variance Trade\-off
The structural configuration of the pre\-specified super\-graph natively induces a classicbias\-variance trade\-off\. To formalize this, consider the expected NLL risk of our model over atarget conditional distributionP∗\(y\|x\)P^\{\*\}\(y\|x\)\. LetP𝒢\(y\|x;𝒟\)P\_\{\\mathcal\{G\}\}\(y\|x;\\mathcal\{D\}\)denote thepredictive distributionproduced by our graph𝒢\\mathcal\{G\}trained on dataset𝒟\\mathcal\{D\}, and letP¯𝒢\(y\|x\)=𝔼𝒟\[P𝒢\(y\|x;𝒟\)\]\\bar\{P\}\_\{\\mathcal\{G\}\}\(y\|x\)=\\mathbb\{E\}\_\{\\mathcal\{D\}\}\[P\_\{\\mathcal\{G\}\}\(y\|x;\\mathcal\{D\}\)\]be theexpected predictionover all possible datasets\. Following the standard generalized bias\-variance decomposition for cross\-entropy loss, the expected risk can be expressed as \(details see Appendix\.[D](https://arxiv.org/html/2605.26147#A4)\):
𝔼𝒟\[𝔼y∼P∗\(⋅\|x\)\[−logP𝒢\(y\|x;𝒟\)\]\]≈ℋ\(P∗\)⏟Irreducible Noise\+DKL\(P∗∥P¯𝒢\)⏟Bias\+𝔼𝒟\[DKL\(P¯𝒢∥P𝒢\(⋅;𝒟\)\)\]⏟Variance\\mathbb\{E\}\_\{\\mathcal\{D\}\}\\Big\[\\mathbb\{E\}\_\{y\\sim P^\{\*\}\(\\cdot\|x\)\}\\big\[\-\\log P\_\{\\mathcal\{G\}\}\(y\|x;\\mathcal\{D\}\)\\big\]\\Big\]\\approx\\underbrace\{\\mathcal\{H\}\(P^\{\*\}\)\}\_\{\\text\{Irreducible Noise\}\}\+\\underbrace\{D\_\{\\text\{KL\}\}\\big\(P^\{\*\}\\\|\\bar\{P\}\_\{\\mathcal\{G\}\}\\big\)\}\_\{\\text\{Bias\}\}\+\\underbrace\{\\mathbb\{E\}\_\{\\mathcal\{D\}\}\\big\[D\_\{\\text\{KL\}\}\\big\(\\bar\{P\}\_\{\\mathcal\{G\}\}\\\|P\_\{\\mathcal\{G\}\}\(\\cdot;\\mathcal\{D\}\)\\big\)\\big\]\}\_\{\\text\{Variance\}\}\(10\)whereℋ\\mathcal\{H\}is the Shannon entropy andDKLD\_\{\\text\{KL\}\}denotes the Kullback\-Leibler divergence\. The dimensions of our routing tree, namely its maximal depthLLand its average branching widthW¯\\bar\{W\}, govern the tension between the Bias and Variance terms:
- \-Width \(Branching Factor\):increasing the number of outgoing edgesW¯=𝔼\[\|𝒜\(vt\)\|\]\\bar\{W\}=\\mathbb\{E\}\[\|\\mathcal\{A\}\(v\_\{t\}\)\|\]allows for a finer partition of the input feature space\. This high expressivity reduces theBiasterm, as the model can dedicate highly specialized local experts to distinct sub\-populations of the data\. However, excessive width leads to severe data fragmentation\. Assuming a balanced tree, the expected number of training samples reaching a node at depthllscales proportionally to\|𝒟\|/W¯l\|\\mathcal\{D\}\|/\\bar\{W\}^\{l\}\. As the sample size per node plummets, the local parameter estimates \(𝐖v,𝐛v\\mathbf\{W\}\_\{v\},\\mathbf\{b\}\_\{v\}\) become highly sensitive to noise in the specific training set𝒟\\mathcal\{D\}, drastically increasing theVarianceterm and risking localized overfitting\.
- \-Depth \(Sequential Steps\):deeper graphs \(L→largeL\\to\\text\{large\}\) allow for prolonged evidence accumulation and complex hierarchical reasoning, further minimizing theBiasby approximating highly non\-linear decision boundaries\. Yet, excessive depth exponentially exacerbates theVariancethrough two mechanisms: compounding routing stochasticity along the trajectory, and forcing the fragmentation of data acrossW¯L\\bar\{W\}^\{L\}terminal leaf nodes\.
Adaptive Regularization via the Belief State:if the tree topology were fixed and fully traversed for every input, minimizing the expected risk would require an arduous manual search for the optimal hyper\-parameters\(L,W¯\)\(L,\\bar\{W\}\)\. Our framework circumvents this via the dynamically updated Dirichlet belief state𝜶t\\bm\{\\alpha\}\_\{t\}\.
By imposing an intermediate entropy penaltyℋ\(𝜶t\)\\mathcal\{H\}\(\\bm\{\\alpha\}\_\{t\}\)and utilizing early exiting thresholdsη\\eta, the network replaces the global structural constantsLLandW¯\\bar\{W\}with sample\-dependent variablesL\(x\)L\(x\)andW\(x\)W\(x\)\. Easy, unambiguous samples are routed through shallow, highly populated paths \(L\(x\)L\(x\)is small, maximizing local data density and minimizing variance\), while complex samples are permitted to traverse deeper, specialized sub\-graphs \(L\(x\)L\(x\)is large, minimizing bias\)\. Consequently, the framework autonomously calibrates the bias\-variance trade\-off on a per\-instance basis, maximizing representational capacity without sacrificing estimator stability\.
## 6Experiments191919The experimental Python codes were largely enabled with the kind assistance of Gemini 3\.0\[[24](https://arxiv.org/html/2605.26147#bib.bib52)\], for which the author gratefully acknowledges\.
To empirically validate the NBSR framework, we design a diverse suite of experiments\. We evaluate computational efficiency and structural emergence on a visual benchmark \(CIFAR\-10\), test interpretability and resource\-rationality in a structured medical diagnostic domain, and further extend the framework to language modeling, autonomous control \(POMDPs\), and active clinical triage\.
### 6\.1A Toy Experiment: Sequential Belief Sharpening
Before evaluating NBSR on high\-dimensional benchmarks, we first present an illustrative toy experiment to visually validate the theoretical guarantees of Theorem[1](https://arxiv.org/html/2605.26147#Thmtheorem1)\(Strict Precision Monotonicity and Bounded Variance\)\. We utilize the classicIris dataset, deliberately restricting the input space to two continuous dimensions \(Sepal LengthandSepal Width\) to enable a direct, interpretable visual plot of the 2D decision manifolds and their associated epistemic uncertainties\.
##### Experimental setup\.
We instantiate a miniature NBSR tree with a maximum depth of 2\. The outcome space consists ofK=3K=3distinct flower species\. To simultaneously visualize both the categorical prediction and the model’s confidence, we map the three outcome classes to distinct RGB color channels \(e\.g\. Red, Green, Blue\)\. For any given coordinate in the 2D feature space, the base color hue is determined by the Bayesian expected marginals𝔼\[pk\]\\mathbb\{E\}\[p\_\{k\}\], while the colorsaturation\(intensity\) is scaled proportionally to the total Dirichlet precisionα0\(t\)\\alpha\_\{0\}^\{\(t\)\}\.
Consequently, regions of high epistemic uncertainty \(where the model lacks extracted evidence andu=Kα0u=\\frac\{K\}\{\\alpha\_\{0\}\}is high\) appear shallow, faded, or white\. Conversely, regions of high confidence \(where total precisionα0\\alpha\_\{0\}is massively accumulated\) appear as deep, vivid, solid colors\.
##### Architectural design\.
To perfectly isolate and visualize the mechanics of sequential Bayesian evidence accumulation, we implement two specific architectural simplifications for this toy scenario:
1. 1\.Single Active Trajectory:Rather than instantiating a full super\-graph with a stochastic routing policy \(πθ\\pi\_\{\\theta\}\), we hardcode a single, pre\-determined active path \(Root→\\toMid\-Expert→\\toLeaf\-Expert\)\. This removes the discreteargmaxrouting logic, allowing us to visualize the continuous geometric updates down a single logical branch\.
2. 2\.Bounded vs\. Unbounded Capacity:To accurately simulate a hierarchical taxonomy, intermediate nodes must act asbroadsuper\-categories, while leaf nodes act asspecificspecialists\. To mathematically enforce this capacity constraint, we restrict the mid\-level expert using a scaledSigmoidactivation to upper bound the maximum evidence it can extract202020Here theSigmoidfunction is used to bound the maximum scalar evidence the expert can extract \- effectively serving as a strict epistemic confidence budget rather than acting in its traditional capacity as a simple non\-linear activation function\.\. The terminal leaf expert retains the standard unboundedSoftplusactivation\.
Importantly, both experts query the exact samePersistent Global Knowledge Oracle \(𝐡x\\mathbf\{h\}\_\{x\}\)\. In standard feed\-forward networks, features are passed and transformed sequentially layer\-by\-layer \(e\.g\.h1→h2h\_\{1\}\\to h\_\{2\}\)\. In contrast, NBSR broadcasts the shared global oracle𝐡x\\mathbf\{h\}\_\{x\}to all levels of the DAG\. It is entirely up to the individual experts to utilize their own specialized weights \(𝐖v\\mathbf\{W\}\_\{v\}in Eq\.[5](https://arxiv.org/html/2605.26147#S4.E5)\) to actively filter and extract the specific knowledge they require directly from this shared state\.
The computational flow of the simulated active trajectory unfolds as follows:
- •Depth 0 \(Initialization\):The belief state is initialized to the uniform prior,𝜶0=\[1,1,1\]\\bm\{\\alpha\}\_\{0\}=\[1,1,1\], representing maximum entropy\.
- •Depth 1 \(Intermediate Node\):The mid\-level expert actively queries𝐡x\\mathbf\{h\}\_\{x\}and extracts a bounded evidence vector𝐞0\\mathbf\{e\}\_\{0\}\. The belief updates via exact conjugate addition:𝜶1=𝜶0\+𝐞0\\bm\{\\alpha\}\_\{1\}=\\bm\{\\alpha\}\_\{0\}\+\\mathbf\{e\}\_\{0\}\.
- •Depth 2 \(Terminal Node\):The leaf expert queries the exact same𝐡x\\mathbf\{h\}\_\{x\}and extracts an unbounded evidence vector𝐞1\\mathbf\{e\}\_\{1\}\. The final belief updates:𝜶2=𝜶1\+𝐞1\\bm\{\\alpha\}\_\{2\}=\\bm\{\\alpha\}\_\{1\}\+\\mathbf\{e\}\_\{1\}\.
##### Results: Layer\-by\-Layer Variance Collapse\.
By manually evaluating the 2D test space at each discrete timestepttand plotting the intermediate Dirichlet belief states, we observe the exact spatial manifestation of the Bayesian “sharpening” effect \(Fig\.[2](https://arxiv.org/html/2605.26147#S6.F2)\):
- •t=1t=1\(Mid\-Level Routing\):The bounded mid\-level expert extracts the initial evidence vector𝐞0\\mathbf\{e\}\_\{0\}\. The resulting intermediate posterior𝜶1\\bm\{\\alpha\}\_\{1\}yields a roughly discernible classification boundary\. However, because the accumulated precisionα0\(1\)\\alpha\_\{0\}^\{\(1\)\}is mathematically constrained, the decision regions exhibit high variance and are visually shallow and faded \(blended with white\)\. The model has successfully established a directional hypothesis but remains appropriately hesitant\.
- •t=2t=2\(Terminal Leaf Expert\):The unbounded leaf expert queries the oracle, extracting and conjugately adding𝐞1\\mathbf\{e\}\_\{1\}\. As established in Theorem[1](https://arxiv.org/html/2605.26147#Thmtheorem1), the total precisionα0\(2\)\\alpha\_\{0\}^\{\(2\)\}strictly increases\. Visually, the variance dramatically collapses: the previously faded regions transform into deep, highly saturated RGB colors \(each color represents a decision class\)\. The transition zones between the three classes shrink from broad, blurry gradients into sharp, highly confident decision boundaries\.
Figure 2:Evolution of the 2D decision boundary and epistemic uncertainty on the Iris dataset\. Color saturation reflects the total Dirichlet precisionα0\(t\)\\alpha\_\{0\}^\{\(t\)\}\.\(Left\)At Depth 1, the bounded intermediate expert extracts limited initial evidence, forming a broad but highly uncertain \(faded\) hypothesis\.\(Right\)At Depth 2, the unbounded terminal expert injects massive evidence, driving precision upward and causing the variance to collapse into a highly confident, sharply defined decision manifold with deep colors\.This toy experiment provides direct empirical proof that the NBSR framework does not merely shift probability mass between classes, but actively inflates the volume of evidence to explicitly shrink epistemic variance layer\-by\-layer\.
### 6\.2Visual Categorization: CIFAR\-10
#### 6\.2\.1Experimental Setup and Baselines
We evaluate our sequential routing framework using a truncated ResNet\-18 backbone configured under aPredefined Taxonomygraph structure\. This human\-logical hierarchy organizes the 10 CIFAR\-10 classes by splitting them into semantic super\-categories\. At Depth 1, the graph routes between Animals \(6 classes\) and Vehicles \(4 classes\)\. At Depth 2, the graph allocates 5 specialized leaf experts to handle specific sub\-populations:Pets\(cat, dog\),Wildlife\(deer, horse, frog\),Birds\(bird\),AirWater\(airplane, ship\), andRoad\(automobile, truck\)\. The maximal graph topology is structured as follows:
```
[Level 0: Root, Width: 1] Root Router
/ \
/ \
[Level 1: Mid, Width: 2] Animal Vehicle
/ | \ / \
/ | \ / \
[Level 2: Leaf, Width: 5] Pets Wildlife Birds AirWater Road
```
##### Decoupling Graph Topology from the Outcome Space
A common misconception regarding classical hierarchical decision trees is that the terminal leaf nodes \(baskets\) correspond directly to the final classes \(e\.g\. 5 leaves equate to 5 classes\), or that local experts output static, class\-agnostic templates\. In the NBSR framework, the graph topology is fully decoupled from theKK\-dimensional outcome space\. Each leaf node represents anExpert\(specialized neural network module\), and every single expert, regardless of its position in the tree, computes a full1010\-dimensional evidence vector\. Thus each node, be it intermediate or terminal, is a decision\-maker which produces the outcome \(belief\) with different levels of fidelity \- they target being trained to be specialized consultants for one or more classes, but not all\.
This means the number of leaf nodes \(55\) does not dictate the number of output classes \(1010\)\. Instead, the 10 CIFAR\-10 classes are distributed across the 5 leaf\-node experts through the following learned specialization mapping:
- \-Animals \(6 Classes\)→\\rightarrow3 Experts:Pets\(Cat, Dog\),Wildlife\(Deer, Horse, Frog\), andBirds\(Bird\)\.
- \-Vehicles \(4 Classes\)→\\rightarrow2 Experts:AirWater\(Airplane, Ship\) andRoad\(Automobile, Truck\)\.
Even though an expert likePetsspecializes in a specific sub\-population \(cats and dogs\), its output remains a full 10\-dimensional vector\. For example, consider an image of a dog routed down the predefined taxonomy\. The root router identifies the input as an animal, and the subsequent router forwards it to thePetsexpert\. ThePetsexpert does not simply apply a static ”\+1 to Pets” update; rather, it acts as a dynamic function applied to the specific image’s global features:
1. 1\.The Oracle Holds Specifics:the global knowledge oracle𝐡x\\mathbf\{h\}\_\{x\}encodes highly specific, high\-dimensional visual concepts \(e\.g\. floppy ears, a long snout, fur texture\) rather than mere abstract categorical tags\.
2. 2\.Weights as Distinct Filters:the expert calculates the evidence vector \(Eq\.[5](https://arxiv.org/html/2605.26147#S4.E5)\)𝐞pets=Softplus\(𝐖pets𝐡x\+𝐛pets\)\\mathbf\{e\}\_\{\\text\{pets\}\}=\\text\{Softplus\}\(\\mathbf\{W\}\_\{\\text\{pets\}\}\\mathbf\{h\}\_\{x\}\+\\mathbf\{b\}\_\{\\text\{pets\}\}\)\. Importantly,𝐖pets∈ℝ10×d\\mathbf\{W\}\_\{\\text\{pets\}\}\\in\\mathbb\{R\}^\{10\\times d\}contains a distinct row vector for each of the 10 classes\. The row vector𝐰dog\\mathbf\{w\}\_\{\\text\{dog\}\}learns to align precisely with dog\-specific features, while𝐰cat\\mathbf\{w\}\_\{\\text\{cat\}\}aligns with cat\-specific features\.
3. 3\.Asymmetric Evidence Extraction:the inner product between𝐡x\\mathbf\{h\}\_\{x\}and𝐖pets\\mathbf\{W\}\_\{\\text\{pets\}\}yields a highly asymmetric 10\-dimensional evidence vector\. The dog dimension receives a massive positive scalar, the cat dimension receives a minor scalar \(due to shared “furry” features\), and evidence for unrelated classes \(e\.g\. airplane and ship\) is pushed to near\-zero values\.
Thus, the router’s role is simply to delegate the computational flow to the expert equipped with the most highly\-trained filters for that specific data sub\-population\. The chosen expert then independently interrogates the raw data to sharpen the Dirichlet distribution around a single specific class, preventing probability smearing across the broader sub\-category\.
##### Training Details & Baselines
To ensure a fair comparison, all models in this experiment share an identical, unfrozen ImageNet\-pretrained212121Specifically, we initialize the network using weights pre\-trained on ImageNet\-1K\[[11](https://arxiv.org/html/2605.26147#bib.bib25)\]\. This provides the backbone with rich, universal visual representations \(e\.g\. edges and textures\) prior to fine\-tuning on CIFAR\-10\. Importantly, all backbone parameters remain unfrozen during our training process, allowing the optimizer to fully adapt these generalized features to our specific classification task\.ResNet\-18 backbone\[[28](https://arxiv.org/html/2605.26147#bib.bib71)\], an identical spatial and pixel\-level augmentation pipeline, and the same deterministic initialization seeds\. Following Appendix\.[C](https://arxiv.org/html/2605.26147#A3), the NBSR framework employs a Gumbel\-Softmax temperature annealing schedule, decayingτ\\taufrom1\.01\.0to0\.10\.1via a0\.970\.97epoch\-wise rate over a 150\-epoch training regime\. We utilize theAdam optimizer\[[47](https://arxiv.org/html/2605.26147#bib.bib24)\]with a learning rate of2×10−42\\times 10^\{\-4\}and a StepLR scheduler \(step size 45,γ=0\.5\\gamma=0\.5\)\.
We benchmark against two controlled reference architectures:
- •Flat ResNet\-18:we modify the standard architecture by simply replacing its original ImageNet classification head with a 10\-class linear layer222222Since the original ResNet\-18 was designed for ImageNet \(1000 classes\), the only modification made for this baseline was swapping its final fully\-connected layer to match the CIFAR\-10 classification task\.\. This serves to measure the baseline representation capacity of a standard, dense forward\-pass network on this dataset\.
- •Sparse MoE \(Soft Routing\):a standard Mixture\-of\-Experts architecture\. To match NBSR’s structural capacity, it employs exactly 5 parallel experts\. It utilizes continuousTop\-2 softmax gating232323In Top\-2 gating, a routing network computes a softmax probability distribution over all available experts\. For each input, only the two experts with the highest probabilities are executed\. Their outputs are then aggregated via a weighted sum using their renormalized routing probabilities, achieving sparsity while remaining fully differentiable\.\(soft routing\) and incorporates a standard load\-balancing auxiliary loss242424Sparse MoE networks are highly susceptible to routing collapse \(often termed the “dead expert problem”\), a self\-reinforcing failure mode where the gate disproportionately routes inputs to a single favored expert while the others starve for gradients and fail to learn\[[70](https://arxiv.org/html/2605.26147#bib.bib22),[16](https://arxiv.org/html/2605.26147#bib.bib26)\]\. The standard load\-balancing auxiliary loss prevents this by penalizing the variance in expert utilization distributions across a training batch, forcing the model to distribute inputs uniformly\.\(λaux=0\.1\\lambda\_\{aux\}=0\.1\) to discourage representation collapse\.
#### 6\.2\.2Results and Analysis
##### Overall Performance and Calibration\.
Table[2](https://arxiv.org/html/2605.26147#S6.T2)summarizes the performance comparison of our proposed NBSR framework against the strictly controlledFlat ResNet\-18andSparse MoEbaselines\. Overall, NBSR achieves the highest peak accuracy \(96\.74%\) while operating at a computational speed \(53\.5s/epoch training, 5\.1s/epoch inference\) virtually identical to the completely unrouted Flat ResNet\-18\. Strikingly, the soft\-routing Sparse MoE yields the lowest accuracy \(96\.43%\) and the highest computational overhead, which empirically validates our hypothesis that forcing standard continuous soft\-routing onto semantic tasks induces destructive optimization friction and representation collapse\.
Further, our framework demonstrates superior predictive reliability\. By evaluating our Bayesian expected marginals𝔼\[pk\]\\mathbb\{E\}\[p\_\{k\}\]\([cc\.Eq\.17](https://arxiv.org/html/2605.26147#S3.Ex3)\) on the in\-distribution test set, NBSR achieves anExpected Calibration Error252525Definition and calculation of ECE can be found in Appendix\.[F](https://arxiv.org/html/2605.26147#A6)\.\(ECE\) of0\.015\. This represents a profound improvement in uncertainty calibration over both the standard Flat ResNet \(0\.027\) and the Sparse MoE \(0\.028\)\. The training dynamics of NBSR is presented in Fig\.[3](https://arxiv.org/html/2605.26147#S6.F3); for completeness, the raw training loss and test accuracy dynamics for both reference baselines are provided in Appendix[F](https://arxiv.org/html/2605.26147#A6)\.
Table 2:Empirical Comparison of NBSR against two controlled baselines on CIFAR\-10\.Note:All models share an identical backbone, seed, and data pipeline\. Times represent average per\-epoch hardware execution speed\.
##### Information Acquisition vs\. Semantic Accuracy\.
The hyperparameterλ\\lambdain the training objective Eq\.[7](https://arxiv.org/html/2605.26147#S4.E7)governs a critical trade\-off between the model’s semantic accuracy and its rate of information acquisition\. In practice, the Negative Log\-Likelihood \(NLL\) remains the primary driver of feature discovery during early training\. However, as the model converges and the NLL approaches its theoretical lower bound \(asymptotic to0\.001−0\.0020\.001\-0\.002by Epoch 100 in our tests\), the entropy penaltyℋ\(⋅\)\\mathcal\{H\}\(\\cdot\)becomes the numerically dominant component of the loss function\. This stage represents a distinct transition fromfeature learningtobelief sharpening\.
As illustrated in Fig\.[3](https://arxiv.org/html/2605.26147#S6.F3), the weighted structural pressure \(\|λ⋅ℋ\|\|\\lambda\\cdot\\mathcal\{H\}\|\) eventually overtakes the semantic loss and steadily climbs, crystallizing the model’s confidence\. In our experiments, a value ofλ=10−3\\lambda=10^\{\-3\}provided a stable “Bayesian nudge”, preventing the representation collapse seen in the MoE baseline, allowing the NBSR model to reach a SOTA peak accuracy of96\.74%\.
Figure 3:Evolution of the two loss components over 150 epochs: the semantic NLL strictly governs initial feature learning, while the structural entropy penalty becomes dominant during late\-stage convergence to enforce belief sharpening and asymptotic calibration\.
##### Precision Monotonicity and Entropy Sharpening\.
We validate Theorem[1](https://arxiv.org/html/2605.26147#Thmtheorem1)by recording the trajectory of the Dirichlet parameters across the discrete tree depths\. To evaluate this global behavior, we compute the total precision and differential entropy for each individual image and plot the averaged values across the entire 10,000\-image CIFAR\-10 test set\. As shown in Fig\.[4](https://arxiv.org/html/2605.26147#S6.F4), in\-distribution samples exhibit a mathematically consistent “sharpening” effect\. Specifically, we observe a strictly monotonic increase in the mean total concentration262626While Subjective Logic formally defines\[[69](https://arxiv.org/html/2605.26147#bib.bib28)\]epistemic uncertainty asu=K/α0u=K/\\alpha\_\{0\}, we omit explicit plots ofuuin our evaluations to avoid redundancy\. Because the class countKKis constant,uuis strictly inversely proportional to the total precision\. Therefore, the massive monotonic increase inα0\(t\)\\alpha\_\{0\}^\{\(t\)\}demonstrated here mathematically guarantees a proportional collapse inuu\. We instead rely onExpected Calibration Error\(ECE\) anddifferential entropyℋ\(𝜶\)\\mathcal\{H\}\(\\bm\{\\alpha\}\)as our primary empirical and operational metrics for uncertainty\.α0\(t\)\\alpha\_\{0\}^\{\(t\)\}, which rises from the root prior of1010\(Depth 0\) to a massive leaf precision exceeding60006000\(Depth 2\), corresponding to a dramatic narrowing of the probability simplex\.
Figure 4:Evolution of the belief state𝜶t\\bm\{\\alpha\}\_\{t\}, averaged across the CIFAR\-10 test set\. \(Left\) The monotonic increase of the mean total concentrationα0\(t\)\\alpha\_\{0\}^\{\(t\)\}across tree deptht∈\{0,1,2\}t\\in\\\{0,1,2\\\}as proven in Theorem[1](https://arxiv.org/html/2605.26147#Thmtheorem1)\. \(Right\) The corresponding sequential collapse in mean differential entropyℋ\(𝜶t\)\\mathcal\{H\}\(\\bm\{\\alpha\}\_\{t\}\)\.
##### Accuracy\-Efficiency Pareto Frontier\.
To evaluate the practical efficacy of our dynamic early\-exiting strategy on the test set, we sweep the entropy thresholdη∈\[−66\.0,−58\.0\]\\eta\\in\[\-66\.0,\-58\.0\]to generate an empirical Accuracy\-Efficiency frontier\. Strikingly, our results demonstrate a perfectly flat resource\-rationality curve \(Fig\.[5](https://arxiv.org/html/2605.26147#S6.F5)\)\. By relaxing the threshold toη=−58\.0\\eta=\-58\.0, the model successfully triggers an early exit at Depth 1 fornearly 90%of CIFAR\-10 images while maintaining a flat accuracy of 96\.74%\. This confirms that the intermediate differential entropy testℋ\(𝜶1\)<η\\mathcal\{H\}\(\\bm\{\\alpha\}\_\{1\}\)<\\eta\(Fig\.[1](https://arxiv.org/html/2605.26147#S4.F1)\) acts as a good assessment for sample difficulty, capturing massive FLOP savings without inducing performance degradation\.
Figure 5:Accuracy\-Efficiency Pareto Frontier demonstrating dynamic early\-exiting capabilities without performance degradation\. The individual data points were generated by sweeping the early\-exiting entropy thresholdη\\etaacross discrete values from−66\.0\-66\.0to−58\.0\-58\.0\. Each point represents an evaluation over the entire CIFAR\-10 test set, plotting the percentage of samples that successfully exit at Depth 1 \(representing computational savings\) against the corresponding overall test accuracy\.
##### OOD Detection\.
A critical vulnerability of standard deterministic neural networks is their tendency to make highly confident, possibly incorrect, predictions when faced with entirely unfamiliar data\. To evaluate whether our framework successfully captures epistemic uncertainty, we assess out\-of\-distribution \(OOD\) detection by plotting the terminal entropies for CIFAR\-10 \(In\-Distribution\) against the unseen SVHN\[[57](https://arxiv.org/html/2605.26147#bib.bib27)\]test set \(OOD\)\. As seen in Fig\.[6](https://arxiv.org/html/2605.26147#S6.F6), OOD samples retain significantly higher entropy, which implies that the model ”knows what it does not know” and avoids overconfidence bias\.
Figure 6:Terminal Entropy Distribution demonstrating robust Out\-Of\-Distribution \(OOD\) detection\. The histograms were generated by passing the full CIFAR\-10 test set \(blue\) and the unseen SVHN test set \(orange\) through the network to the terminal leaf experts, calculating the final differential entropyℋ\(𝜶T\)\\mathcal\{H\}\(\\bm\{\\alpha\}\_\{T\}\)for every individual image\. The clear rightward shift of the SVHN distribution proves that the framework inherently assigns higher uncertainty to alien concepts, allowing for easy OOD rejection\.
### 6\.3Structured Medical Diagnosis272727In this context, “structured” refers to tabular data characterized by predefined, discrete feature columns \(e\.g\. binary symptom indicators\), in direct contrast to the unstructured spatial manifolds \(e\.g\. raw image pixels\) evaluated in the previous visual categorization task\.
#### 6\.3\.1Experimental Setup and Baselines
We adapt our sequential routing framework to a structured tabular domain using a real\-world clinical dataset mapping patient symptom profiles to specific disease endpoints\. To maintain consistency with our core methodology \(Section[4](https://arxiv.org/html/2605.26147#S4)\), the sparse binary patient featuresxx\(132 symptoms\) are first projected by a shared MLP backbone into a dense, continuous representation to form theGlobal Knowledge Oracle𝐡x\\mathbf\{h\}\_\{x\}\. To simulate the ambiguity and missingness inherent to real\-world Electronic Health Records \(EHR\), we inject a 5% uniform noise rate across the symptom matrices during training and evaluation\.
The routing graph𝒢\\mathcal\{G\}is configured to logically mirror a clinical diagnostic triage pathway\. The root node delegates the patient to one of four broad physiological expert modules, which subsequently route to highly specific terminal pathologies \(spanning 41 diseases\)\. The maximal graph topology is structured as follows:
```
[Level 0: Root, Width: 1] Root Triage
/ | | \
/ | | \
[Level 1: Mid, Width: 4] Hepatic Respiratory Dermatological Cardio/Neuro
| | | |
| | | |
[Level 2: Leaf, Width: 4] Hepatic-Leaf Resp-Leaf Derm-Leaf C/N-Leaf
(e.g. Hepatitis) (e.g. Asthma) (e.g. Fungal) (e.g. Migraine)
```
At each traversed nodevtv\_\{t\}, the resident expert network acts as a specialized “diagnostic test”, querying the global patient oracle𝐡x\\mathbf\{h\}\_\{x\}to extract local evidence𝐞t\\mathbf\{e\}\_\{t\}\. We benchmark against standard tabular baselines, including a standard denseMLPandeXtreme Gradient Boosted Trees \(XGBoost\)\[[8](https://arxiv.org/html/2605.26147#bib.bib30)\]\. Further experimental details can be found in Appendix\.[G](https://arxiv.org/html/2605.26147#A7)\.
#### 6\.3\.2Results and Analysis
##### Diagnostic Audit Trails and Belief Shifts\.
The primary advantage of NBSR in healthcare is its native, mathematically rigorous interpretability\. Standard deep learning models are generally considered as black boxes, and ensemble trees such as XGBoost rely heavily on post\-hoc approximations \(e\.g\. SHAP\[[52](https://arxiv.org/html/2605.26147#bib.bib29)\]or LIME\[[66](https://arxiv.org/html/2605.26147#bib.bib31)\]\) that estimateglobalfeature importance but do not reflect the actual sequential decision\-making nature\.
In contrast, NBSR yields a transparent, forward\-causal audit trail forevery single input\. We visualize individual patient trajectories as a directed walk through the DAG\. At each juncturett, the physician can observe both the deterministic routing decisionata\_\{t\}and the explicit “Belief Shift”, quantified by the Kullback\-Leibler divergenceDKL\(Dir\(𝜶t\+1\)∥Dir\(𝜶t\)\)D\_\{\\text\{KL\}\}\(\\text\{Dir\}\(\\bm\{\\alpha\}\_\{t\+1\}\)\\parallel\\text\{Dir\}\(\\bm\{\\alpha\}\_\{t\}\)\)\(derivations in Appendix[A](https://arxiv.org/html/2605.26147#A1)\)\. As demonstrated in Fig\.[7](https://arxiv.org/html/2605.26147#S6.F7), because the extracted evidence𝐞t\\mathbf\{e\}\_\{t\}is strictly additive, this audit trail perfectly isolates exactlywhichexpert contributed to the final diagnosis and byhow much, offering next level of clinical accountability\.
Figure 7:A sample diagnostic audit trail for Patient 0\. The graph explicitly tracks the sequentialBelief Shift\(DKLD\_\{\\text\{KL\}\}\)\. The model routes the patient from the Root prior, aggregates significant evidence at a broad physiological Mid\-Expert \(e\.g\. Dermatological\), and refines the hypothesis at the Terminal Leaf to reach a confident diagnosis of Fungal Infection\.
##### Path\-Dependent Feature Attribution\.
A standard critique of applying deep learning to tabular clinical data is the loss of feature\-level interpretability\. While baselines such as XGBoost natively provide feature importance via variance reduction, and general neural networks rely on post\-hoc approximations \(e\.g\. SHAP\[[52](https://arxiv.org/html/2605.26147#bib.bib29)\]\)\. Our NBSR framework provides an exact, analytically differentiable feature attribution mechanism\.
Because the terminal Dirichlet concentration for the predicted diagnosisy∗y^\{\*\}is a strict linear accumulation of the localized evidence vectors extracted along the active trajectory𝒯\\mathcal\{T\}, defined asαT,y∗=α0,y∗\+∑t=0T−1et,y∗\\alpha\_\{T,y^\{\*\}\}=\\alpha\_\{0,y^\{\*\}\}\+\\sum\_\{t=0\}^\{T\-1\}e\_\{t,y^\{\*\}\}, we can calculate the exact feature importance vectorℐ\(x\)∈ℝ\|x\|\\mathcal\{I\}\(x\)\\in\\mathbb\{R\}^\{\|x\|\}using gradient\-based attribution \(i\.e\. input×\\timesgradient\)282828In gradient\-based attribution, the partial derivative quantifies the model’s localsensitivityto a specific feature \-a large gradient indicates that even minor perturbations in that feature’s value would induce massive shifts in the predicted evidence\. By multiplying this sensitivity by the feature’s actual input magnitude \(input×\\timesgradient\), we obtain a first\-order Taylor approximation of that feature’s total additive contribution to the final diagnostic belief\.\. By applying the chain rule through the global oracle𝐡x=fϕ\(x\)\\mathbf\{h\}\_\{x\}=f\_\{\\phi\}\(x\), the total feature importance perfectly decomposes into the specific contributions from each visited expert:
ℐ\(x\)=x⊙∂αT,y∗∂x=x⊙∑t=0T−1\(∂et,y∗∂𝐡x∂𝐡x∂x\)=x⊙∑t=0T−1\(𝐉𝐡x\(x\)T∇𝐡xet,y∗\)\\mathcal\{I\}\(x\)=x\\odot\\frac\{\\partial\\alpha\_\{T,y^\{\*\}\}\}\{\\partial x\}=x\\odot\\sum\_\{t=0\}^\{T\-1\}\\left\(\\frac\{\\partial e\_\{t,y^\{\*\}\}\}\{\\partial\\mathbf\{h\}\_\{x\}\}\\frac\{\\partial\\mathbf\{h\}\_\{x\}\}\{\\partial x\}\\right\)=x\\odot\\sum\_\{t=0\}^\{T\-1\}\\left\(\\mathbf\{J\}\_\{\\mathbf\{h\}\_\{x\}\}\(x\)^\{T\}\\nabla\_\{\\mathbf\{h\}\_\{x\}\}e\_\{t,y^\{\*\}\}\\right\)\(11\)where⊙\\odotdenotes element\-wise multiplication,∇𝐡xet,y∗\\nabla\_\{\\mathbf\{h\}\_\{x\}\}e\_\{t,y^\{\*\}\}is the gradient of the scalar evidence with respect to the oracle features, and𝐉𝐡x\(x\)\\mathbf\{J\}\_\{\\mathbf\{h\}\_\{x\}\}\(x\)is the Jacobian matrix of the oracle representations with respect to the input\.
This represents a fundamental shift from population\-level statistics \(e\.g\. XGBoost\) to personalized medicine\. As shown in Fig\.[8](https://arxiv.org/html/2605.26147#S6.F8), XGBoost outputs aglobalranking of feature importance cumulated by the entire training set \(e\.g\. highlighting ‘congestion’ or ‘palpitations’\)\. Conversely, NBSR dynamically generates an exact,localattribution specific to the input \(i\.e\. the selected Patient 0\), correctly identifying ‘dischromic patches’ and ‘nodal skin eruptions’ as the driving biometric rationales for their specific Fungal Infection diagnosis\. This traces exactlywhich sub\-specialist expertin the triage pathway utilized those specific features, yielding a fully transparent, step\-by\-step biomechanical rationale\.
Figure 8:Comparison of Feature Attribution\. \(Left\) XGBoost provides a global, population\-level estimate of feature importance via variance reduction, which lacks patient\-specific diagnostic relevance\. \(Right\) NBSR utilizes path\-dependent gradient attribution to extract the exact, localized biometric markers driving Patient 0’s specific diagnosis \(Fungal Infection\)\.
##### Efficacy of Early Exiting and Calibration\.
The dynamic early exiting mechanism in NBSR proves highly effective for routing efficiency\. As shown in Table[3](https://arxiv.org/html/2605.26147#S6.T3), NBSR matches the semantic capacity of the Flat MLP baseline, achieving an identical diagnostic accuracy of97\.62%\.
Further, the entropy thresholdη\\etasuccessfully creates an automated, resource\-rational triage system\. By setting the early exiting threshold toη=−100\.0\\eta=\-100\.0\(the “Fast” configuration\), the model recognizes unambiguous symptom clusters and truncates the inference trajectory early\. This dynamically forces the average evaluated graph depth from2\.02\.0down to1\.01\.0, resulting in a proportional reduction in hardware inference time without any degradation to final predictive accuracy\.
While NBSR successfully saves computational FLOPs, we note a higher ECE compared to the baselines\. Because NBSR’s evidence accumulation is strictly additive \(Theorem\.[1](https://arxiv.org/html/2605.26147#Thmtheorem1)\), the highly deterministic signal\-to\-noise ratio of this specific symptom dataset induces rapid precision scaling, leading the Dirichlet distribution to express hyper\-confident predictions\. In this high\-stakes domain, we purposefully trade a degree of probabilistic softness for exact, sequential interpretability\.
Table 3:Diagnostic performance and efficiency on the clinical Symptom\-Disease dataset\.Note:The NBSR model was trained once; only at inference time do we have deep and fast models\. Depth indicates the average number of nodes visited before a decision is reached \(e\.g\. a depth of 1\.0 implies an early exit at a broad mid\-expert, whereas 2\.0 indicates reaching a highly specialized terminal leaf\)\. Inference times represent the total execution time across the entire test set\.
### 6\.4Language Modelling: Interpretable and Uncertainty\-Aware Next\-Token Prediction
#### 6\.4\.1Experimental Setup and Baselines
Modern Large Language Models \(LLMs\) generally operate via next\-token prediction over a vocabulary space𝒱\\mathcal\{V\}\. However, standard monolithic LMs behave as opaque black boxes and frequently suffer from predictive overconfidence, leading to severe factual hallucinations\[[41](https://arxiv.org/html/2605.26147#bib.bib32),[26](https://arxiv.org/html/2605.26147#bib.bib33),[12](https://arxiv.org/html/2605.26147#bib.bib34)\]\. While ad\-hoc output mechanisms such as top\-kk\[[15](https://arxiv.org/html/2605.26147#bib.bib35)\]or nucleus \(top\-pp\)\[[32](https://arxiv.org/html/2605.26147#bib.bib36)\]sampling292929Top\-kksampling restricts selection to thekkmost probable next tokens, whereas nucleus \(top\-pp\) sampling dynamically restricts selection to the smallest set of tokens whose cumulative probability exceeds a predefined thresholdpp\. Both methods subsequently renormalize the truncated distribution before sampling\.are commonly used to truncate the output distribution and inject diversity, they do not quantify the model’s intrinsic epistemic uncertainty\.
To evaluate NBSR’s reasoning capacity and uncertainty estimation in sequence modeling, we design a contextual next\-token prediction task\. We utilize a controlled syntactic\-semantic disambiguation corpus paired with a word\-level tokenizer to maintain strict PoS boundaries\. The text tokens are embedded and processed by a shared lightweight causal Transformer backbone to yield the contextual token representation𝐡x\\mathbf\{h\}\_\{x\}\.
Instead of a standard dense projection to the vocabulary logits,𝐡x\\mathbf\{h\}\_\{x\}is passed into the NBSR routing graph\. The topology is designed to mirror human linguistic processing\-first resolving broad part\-of\-speech \(PoS\) syntax\[[44](https://arxiv.org/html/2605.26147#bib.bib38),[55](https://arxiv.org/html/2605.26147#bib.bib39)\], then refining via semantic context:
```
[Level 0: Root, Width: 1] Lexical Router
/ \
/ \
[Level 1: Mid, Width: 2] Syntactic Expert Semantic Expert
/ \ / \
/ \ / \
[Level 2: Leaf, Width: 4] Function Words Modifiers Abstract Nouns Concrete Entities
(e.g. the, is) (e.g. fast) (e.g. freedom) (e.g. river, bank)
```
We benchmark against a standardMonolithic Transformerand a standardTransformer MoE\(with an uncalibrated, discrete Top\-1 routing head\)\[[70](https://arxiv.org/html/2605.26147#bib.bib22)\]\. All models are evaluated across a held\-out test set of 2,000 synthetic sequences\. The models are assessed on their semantic learning capacity \(Perplexity\), distributional reliability \(Expected Calibration Error\), and raw computational efficiency \(Tokens per Second\)\.
#### 6\.4\.2Results and Analysis
##### Overall Performance\.
As summarized in Table[4](https://arxiv.org/html/2605.26147#S6.T4), the Deep NBSR model achieves a perplexity of 13\.81\. While this represents a marginal degradation compared to the unconstrained baselines \(13\.43\), it successfully demonstrates that imposing a strict, interpretable hierarchical routing topology largely preserves the core generative capacity of the causal Transformer backbone\.
The baselines achieve a lower empirical ECE \(e\.g\. 0\.011\) than NBSR \(0\.022\)\. However, in highly constrained, low\-vocabulary toy environments, standard ECE computations can artificially favor the sharp, uncalibrated softmax distributions of traditional Transformers\. The primary contribution of NBSR is not to aggressively optimize continuous test\-set ECE, but rather to replace the conceptually flawed softmax output with a Dirichlet distribution that structurally quantifies epistemic uncertainty\.
Further, while the raw CPU inference throughput \(TPS\) is lower for NBSR due to the hardware overhead of PyTorch’s dynamic tensor masking and explicit discrete routing, the architecture proves its theoretical efficiency: the NBSR \(Fast\) configuration successfully halves the required computational graph traversal \(Avg\. Depth 1\.0 vs\. 2\.0\)\. Ultimately, NBSR consciously trades marginal continuous performance overheads for complete architectural transparency and native uncertainty\-awareness\.
Table 4:Language Modeling performance on the contextual disambiguation task\.Note:Average Depth reflects dynamic token\-level routing\. TPS reflects raw CPU inference throughput including the routing mask overhead\. The OOD Abstention Rate evaluates epistemic thresholding \(α0<1\.5×\|𝒱\|\\alpha\_\{0\}<1\.5\\times\|\\mathcal\{V\}\|\); while the miniature toy prior yields 0\.0% empirical abstention here, the mechanism provides the mathematical foundation for scalable hallucination prevention\.
##### Interpretable Token Routing\.
A major advantage of NBSR in sequence modeling is the transparent resolution of linguistic ambiguity\. In standard LMs, polysemous words or complex dependencies are resolved invisibly within dense attention matrices\. NBSR, however,physically routes the token prediction through specialized linguistic sub\-spaces\.
To demonstrate this mechanism, we evaluate the models on an ambiguous but strictlyin\-distributionprompt context:“the dog is by the…”\. The diagnostic output of the baseline models and our NBSR framework is captured in the following trace:
```
Prompt Context: "<bos> the dog is by the ..."
1. Standard Transformer -> Predicts: ’peace’ (Black-box logit max)
2. Transformer MoE -> Predicts: ’truth’ (Discrete routing, no evidence trace)
3. NBSR Language Model -> Predicts: ’war’
[Phase 1] Root Router:
-> Syntactic: 0.00 | Semantic: 1.00
[Phase 2] Total Local Evidence Extracted by Leaves:
-> Function: 0.0 | Modifier: 0.0 | Abstract: 275.5 | Concrete: 320.4
```
To a human reader, all three predictions lack semantic coherence\. However, this perfectly aligns with the isolated nature of our controlled experiment\. Because our synthetic corpus generates text via uniform random sampling across categorical grammar templates \(e\.g\.\[’function’, ’concrete’, ’function’, ’function’, ’abstract’\]\), semantic world\-knowledge is deliberately absent, and predictions rely purely on learned syntactic structure\. The models were trained on exactly 10,000 structurally valid but semantically nonsensical sentences, such as“the car is by the logic”or“a tree was in the fear”\. The networks have absolutely zero real\-world understanding of what a “dog” or a “river” is; they solely recognize the mathematical patterns of the grammar\. Therefore, in this toy universe, predicting an abstract noun like“war”or“peace”is a 100% mathematically correct and valid syntactic completion\.
The critical distinction lies ininterpretability\. While the standard Transformer blindly outputs a high logit for “peace” with zero explanation as to why,NBSR provides a completely transparent, path\-dependent reasoning trace\. The audit trail reveals that NBSR’s Root router correctly deduced that a noun must follow the preposition, allocating 100% of its routing probability \(1\.00\) to the Semantic Expert and successfully zeroing out the Syntactic Expert \(i\.e\. probability 0\.00\)\. Subsequently, the leaf experts extracted massive diagnostic evidence specifically for Concrete \(320\.4\) and Abstract \(275\.5\) nouns, while explicitly extracting 0\.0 evidence for grammatically invalid function words or modifiers\.
Even though it ultimately outputs a token that sounds semantically anomalous to a human, the routing trace mathematically proves that NBSR successfully learned and executed the underlying grammatical rules\.It did not hallucinate an invalid part\-of\-speech\(like a verb\); it predictably narrowed the universe down to a noun,providing a step\-by\-step rationale that standard LLMs completely lack\. By utilizing this synthetic grammar, we perfectly isolated the interpretable routing mechanism, proving that NBSR physically routes tokens through a logical DAG \(Part\-of\-Speech→\\toNoun Type\)\.
##### Uncertainty\-Aware Generationvs\.Top\-kkSelection\.
Traditional top\-kksampling relies on softmax outputs, which are notoriously poorly calibrated\[[26](https://arxiv.org/html/2605.26147#bib.bib33),[12](https://arxiv.org/html/2605.26147#bib.bib34)\]\. Standard neural networks are essentially blind to the boundaries of their own knowledge\. Because the softmax function mathematically forces all output probabilities to sum to 100%, the model is stripped of the ability to output ”I don’t know”\. When presented with an out\-of\-distribution \(OOD\) prompt, i\.e\. a scenario or token that fundamentally differs from the statistical universe the model was trained on, the network pushes the unfamiliar data through its layers and artificially inflates the highest random logit into a confident prediction\[[29](https://arxiv.org/html/2605.26147#bib.bib40),[59](https://arxiv.org/html/2605.26147#bib.bib41)\]\. This structural flaw is the mathematical root cause of AI hallucinations: a model confidently guessing on an unfamiliar input\.
To evaluate this vulnerability, we feed the models an explicit OOD prompt containing an unknown token \(<unk\>\):
```
Ambiguous OOD Prompt: "<bos> the <unk> ..."
1. Standard Transformer -> Predicts: ’or’ with 6.7% confidence.
2. Transformer MoE -> Predicts: ’to’ with 6.6% confidence.
3. NBSR Language Model -> Predicts: ’but’ with 5.7% expected probability.
```
While our toy dataset’s extremely limited vocabulary \(65 tokens\) naturally prevents the extreme 99% overconfidence spikes seen in massive LLMs, which causes all three models to output relatively flat distributions \(∼\\sim6%\) for the unknown context, NBSR provides thearchitecturalfoundation to solve this scaling problem natively: it replaces the forced softmax with a Dirichlet distribution over the vocabulary space𝜶∈ℝ\|𝒱\|\\bm\{\\alpha\}\\in\\mathbb\{R\}^\{\|\\mathcal\{V\}\|\}, which accumulates raw evidence rather than strictly forcing probabilities to sum to 1\. When NBSR encounters an OOD prompt \(like<unk\>\), it searches its specialized routing experts and finds zero matching evidence\. Because it extracts no evidence, the total Dirichlet precisionα0=∑αi\\alpha\_\{0\}=\\sum\\alpha\_\{i\}stays perfectly flat at the uniform prior\.
As formalized inCorollary[1](https://arxiv.org/html/2605.26147#Thmcorollary1)\(Epistemic Collapse\), this failure to accumulate evidence physically traps the system in a state of maximum epistemic uncertainty \(u≈1u\\approx 1\)\. Therefore,NBSR provides a native, mathematically rigorous measure of epistemic uncertainty\. This enablesEvidential Thresholdingas analternativeto top\-kk\. Rather than blindly sampling from the top 50 tokens based on uncalibrated logits, the decoding algorithm can directly evaluate the Dirichlet precisionα0\\alpha\_\{0\}\. When scaled to massive language corpora, ifα0\\alpha\_\{0\}falls below a critical confidence threshold on an OOD prompt, the system mathematically recognizes that it lacks evidence\. It can then gracefully halt, abstain, or ask the user for clarifying context, whichprevents hallucinations at the architectural level rather than relying on post\-hoc output filtering\.
##### Efficacy of Early Exiting on Syntactic Tokens\.
Human language adheres to Zipf’s law303030Zipf’s law is an empirical principle stating that given a large corpus of data, the frequency of an item is inversely proportional to its rank \(f\(r\)∝1rf\(r\)\\propto\\frac\{1\}\{r\}\)\. Consequently, a small handful of highly ranked items \(e\.g\. function words\) dominate the vast majority of occurrences\[[75](https://arxiv.org/html/2605.26147#bib.bib42)\]\., meaning a vast majority of generated tokens are simple, highly predictable function words \(e\.g\.the, a, is, to\)\. Allocating the same amount of computational FLOPs to predict the word“the”as to predict a highly complex domain\-specific noun is highly inefficient\.NBSR’s dynamic early exiting acts as a resource\-rational adaptive compute mechanism\. For highly predictable syntactic tokens, the Mid\-LevelSyntactic Expertextracts massive initial evidence, driving the Dirichlet entropy below the thresholdη\\etaand triggering an early exit at Depth 1\. Deep traversal \(Depth 2\) is automatically reserved only for complex, low\-frequency semantic tokens requiring deeper contextual disambiguation\. This dynamic depth allocation allows the fast configuration to accelerate token generation while maintaining competitive baseline perplexity\.
### 6\.5NBSR\-Mem: NBSR with Dynamic Memory for Control and Planning
Standard deep reinforcement learning and imitation learning policies are notoriously brittle; they operate as opaque black boxes and suffer from catastrophic overconfidence when deployed in out\-of\-distribution \(OOD\) environments\. Further, real\-world autonomous agents rarely possess perfect global information\. When an agent’s observation space is restricted to a local window, the navigation task becomes aPartially Observable Markov Decision Process\(POMDP\)\[[63](https://arxiv.org/html/2605.26147#bib.bib45)\]\. A purelyreactivepolicy is vulnerable tospatial amnesia\- once a crucial environmental cue \(e\.g\. a wall or obstacle\) leaves the local visual frame \(i\.e\. the ’sensory’ or ’receptive’ field\), the agent forgets its existence, leading to failure or infinite navigational loops\. To resolve both the structural opacity of standard neural policies and the limitations of partial observability, we transition fromstaticprediction to a dynamic control task and proposeNBSR\-Mem, which integrates a dynamic recurrent memory buffer into the evidential routing architecture\.
#### 6\.5\.1Task and Experimental Setup\.
The autonomous agent is tasked with navigating a 2D grid environment to reach a randomized goal while avoiding walls and static obstacles\. The environment is strictly partially observable: at any given timesteptt, the agent receives only a5×55\\times 5local spatial matrix centered on its current position \(Fig\.[9](https://arxiv.org/html/2605.26147#S6.F9)\)\. The action space consists of 4 discrete movements: 0 \(Up/Cruising\), 1 \(Down/Reverse\), 2 \(Left/Evasion\), and 3 \(Right/Evasion\)\. A purely reactive policy in this setting is vulnerable to spatial amnesia, i\.e\. once a crucial obstacle or landmark leaves the local visual frame, a reactive agent forgets its existence, which can lead to catastrophic failures or infinite navigational loops\.
To isolate and test the memory mechanism under partial observability, we utilize a procedurally generated sequence of abstracted keyframes representing the crucial phases of a corridor traversal\. A single trajectory is structured as follows \(as illustrated in Fig\.[9](https://arxiv.org/html/2605.26147#S6.F9)\):
- •t=0t=0\(Memory Cue\): the agent receives an initial visual cue \(a sign on the left or right wall\) that dictates whether it must turn left or right at a distant intersection\.
- •t=1,2t=1,2\(Spatial Amnesia Zone\): the agent receives a featureless spatial matrix, representing transit through an empty corridor\. A purely reactive agent loses all historical context during this phase\.
- •t=3t=3\(Intersection\): the agent arrives abruptly at an intersection with a wall directly ahead\. To successfully navigate, it must recall the cue fromt=0t=0and turn in the correct direction\.
ASt=0t=0: Memory Cue\(Agent sees “Left” sign\)At=1,2t=1,2: Empty Corridor\(Spatial Amnesia Zone\)AWt=3t=3: Intersection\(Must recallt=0t=0to turn\)Figure 9:Illustration of the synthetic POMDP keyframe sequence\. Rather than a contiguous simulation, the dataset samples crucial keyframes from a corridor traversal\. Att=0t=0, an initial cue \(S\) dictates the correct future turning direction\. Duringt=1,2t=1,2, the agent receives featureless floor matrices, representing a state of visual ambiguity where reactive policies suffer from “spatial amnesia\.” At the final keyframet=3t=3, the agent arrives abruptly at an intersection \(W\) and must rely exclusively on its recurrent memory buffer to execute the evasion maneuver\.##### Training Paradigm: Behavioral Cloning\.
We employ Behavioral Cloning \(BC\)\[[62](https://arxiv.org/html/2605.26147#bib.bib43)\], a foundational method of Imitation Learning\[[37](https://arxiv.org/html/2605.26147#bib.bib44)\], to train the agent, as illustrated in Fig\.[10](https://arxiv.org/html/2605.26147#S6.F10)\. BC reframesdynamic controlas asupervised classificationtask, where the goal is to learn a policyπθ\\pi\_\{\\theta\}that mimics the actions of an algorithmic expert \(which possesses perfect global information\)\. The algorithmic expert generates a dataset of optimal state\-action pairs\(ot,at∗\)\(o\_\{t\},a^\{\*\}\_\{t\}\)\. The neural actor then learns to map state observations directly to the expert’s discrete actions\. During training, the agent policy,πθ\\pi\_\{\\theta\}, samples a batch of observationsoto\_\{t\}and outputs a predicted actiona^t\\hat\{a\}\_\{t\}\. The network parametersθ\\thetaare updated viaBackpropagation Through Time\(BPTT\) to minimize the Negative Log\-Likelihood \(NLL\) between the predicted action distribution and the expert’s true actionat∗a^\{\*\}\_\{t\}\.
POMDP EnvironmentExpert Oracle\(Perfect Information\)Dataset𝒟\\mathcal\{D\}\{\(o1,a1∗\),…,\(oN,aN∗\)\}\\\{\(o\_\{1\},a^\{\*\}\_\{1\}\),\\dots,\(o\_\{N\},a^\{\*\}\_\{N\}\)\\\}Agent Policyπθ\(a\|ot\)\\pi\_\{\\theta\}\(a\|o\_\{t\}\)\(a^t\\hat\{a\}\_\{t\}\)Supervised LossℒNLL\(a^t,at∗\)\\mathcal\{L\}\_\{NLL\}\(\\hat\{a\}\_\{t\},a^\{\*\}\_\{t\}\)True Statests\_\{t\}Optimal Actionat∗a^\{\*\}\_\{t\}Obsoto\_\{t\}Obs Batchoto\_\{t\}Predicteda^t\\hat\{a\}\_\{t\}Targetat∗a^\{\*\}\_\{t\}Gradient Update∇θ\\nabla\_\{\\theta\}Figure 10:The Imitation Learning \(Behavioral Cloning\) paradigm under partial observability\. An expert oracle utilizes the true state \(sts\_\{t\}\) to generate optimal actions \(at∗a^\{\*\}\_\{t\}\), which are paired with the agent’s limited observations \(oto\_\{t\}\) to form the dataset\. The agent policyπθ\\pi\_\{\\theta\}is trained offline using supervised learning to mimic the expert, isolating architectural efficacy from Reinforcement Learning instabilities\.We deliberately selected this training paradigm over standard Reinforcement Learning \(RL\) for two critical reasons:
- \-Architecture Isolation: by removing the high variance, reward\-shaping dependencies, and hyperparameter sensitivity inherent to RL, any improvements in performance or safety can be strictly attributed to the NBSR topology\.
- \-Covariate Shift Testbed: standard BC policies are notoriously susceptible to covariate shift \- they mimic the expert blindly and fail catastrophically when encountering novel OOD states\. This makes BC an ideal testbed for demonstrating NBSR’s native evidential abstention mechanisms\.
##### NBSR\-Mem and Baselines\.
To constructNBSR\-Mem, we couple theGlobal Knowledge Oracle\(a lightweight CNN\) with aGated Recurrent Unit\(GRU\)\. Importantly, the memory update occurspriorto expert routing\. The CNN features update a persistent hidden state𝐦t\\mathbf\{m\}\_\{t\}encoding the historical trajectory\. Thistemporally\-aware state𝐦t\\mathbf\{m\}\_\{t\}is then passed onto the hierarchical evidential DAG \(illustrated below\), where expert networks are intentionally flattened to single linear layers to ensure unhindered gradient flow\.
```
[Level 0: Root, Width: 1] Navigational Router
/ \
/ \
[Level 1: Mid, Width: 2] Cruising Expert Evasion Expert
/ \ / \
/ \ / \
[Level 2: Leaf, Width: 4] Up Down Left Right
(Straight) (Reverse) (Avoid) (Avoid)
```
To validate the necessity of both temporal memory and evidential routing, we benchmark NBSR\-Mem against 3 distinct architectural configurations:
1. 1\.Standard CNN \(Reactive\):a standard feed\-forward convolutional network utilizing a Softmax output\. Lacking memory, it is expected to succumb to spatial amnesia; lacking evidential routing, it acts as an opaque black\-box prone to OOD overconfidence\.
2. 2\.CNN\-GRU \(Memory\):a temporal baseline where the CNN features update a GRU hidden state before passing to a standard linear classifier\. While capable of solving the POMDP task via memory, it remains structurally opaque and highly vulnerable to OOD hazards\.
3. 3\.NBSR \(Reactive\):the hierarchical evidential routing architecturewithoutthe GRU memory buffer\. It is expected to fail the primary navigation task due to spatial amnesia, but successfully trigger epistemic safety halts when encountering OOD traps\.
##### Training Dynamics: Stabilizing the Recurrent Routing
Training thisRecurrent Evidential Routing Networkis historically challenging\[[70](https://arxiv.org/html/2605.26147#bib.bib22),[60](https://arxiv.org/html/2605.26147#bib.bib47)\]\. Naive implementations typically suffer fromexpert collapse\[[70](https://arxiv.org/html/2605.26147#bib.bib22)\], vanishing gradients\[[31](https://arxiv.org/html/2605.26147#bib.bib46),[60](https://arxiv.org/html/2605.26147#bib.bib47)\], and temporal memory amnesia \(catastrophic forgetting\)\[[2](https://arxiv.org/html/2605.26147#bib.bib48)\]\. To stabilize NBSR\-Mem, we implement 4 critical structural interventions:
1. 1\.Deterministic Routing for BPTT:we replace the stochastic Gumbel\-Softmax with a deterministic, standard Softmax exclusively during training\. This creates a mathematically smooth pathway, allowingBackpropagation Through Time\(BPTT\) to flow seamlessly from the routing leaves, through the hierarchy, and deep into the GRU to link past states to future decisions\.
2. 2\.Semantic Concept Routing \(Auxiliary Loss\):without explicit guidance, hierarchical routers, particularly when utilizing a deterministic Softmax, seek the laziest optimization, routing all states to a single, monolithic “master expert” \(Expert Collapse\)313131While deterministic Softmax exacerbates mode collapse due to its “winner\-take\-all” gradient scaling, expert collapse is fundamentally an optimization pathology where it is mathematically cheaper to update an already\-competent expert than to train a novel one\. This “rich get richer” dynamic occurs across routing mechanisms \(including Sigmoid, Gumbel\-Softmax, and Top\-K\) unless explicit load\-balancing or semantic guidance is applied\.\. To prevent this, we apply a lightweightAuxiliary Concept Loss\(ℒroute\\mathcal\{L\}\_\{route\}\) that physically forces the Root Router to map to human\-interpretable sub\-spaces: straight corridorsmustbe routed to the “Cruising” expert, while complex intersectionsmusttrigger “Evasion\.”
3. 3\.Masked Evidential Regularization:standard evidential regularizers penalize all generated evidence, creating a ”Gradient Tug\-of\-War” where the NLL loss encourages correct evidence, but the regularizer punishes it, resulting in memory failure\. We utilize a target\-masked regularizer\[[69](https://arxiv.org/html/2605.26147#bib.bib28)\]that only penalizes hallucinated evidence onincorrectclasses\. This forces resource rationality, squashing incorrect actions strictly to zero, while leaving the correct class free to soar to massive confidence\. This shows that the original entropy regularization termλ∑vt∈𝒯ℋ\(𝜶t\)\\lambda\\sum\_\{v\_\{t\}\\in\\mathcal\{T\}\}\\mathcal\{H\}\(\\bm\{\\alpha\}\_\{t\}\)in Eq\.[7](https://arxiv.org/html/2605.26147#S4.E7)can sometimes introduce optimization deadlock and requires masking for calibration\.
4. 4\.Targeted Weight Decay and LayerNorm:to anchor the latent space and bound out\-of\-distribution explosions, we apply LayerNorm and targeted L2 weight decay to the CNN backbone\. This ensures untrained visual channels \(e\.g\. alien hazards \- see later\) mathematically decay to zero, preventing random noise from being amplified by the network\.
#### 6\.5\.2Results and Analysis
##### Overall Navigation Performance\.
We benchmark NBSR and NBSR\-Mem against standard continuous CNN and CNN\-GRU \(Memory\) baselines across 1000 held\-out maze configurations\. The primary metric is the episodic Success Rate \(reaching the goal without crashing\)\.
As shown in Table[5](https://arxiv.org/html/2605.26147#S6.T5), purely reactive policies \(Standard CNNandNBSR Reactive\) fail frequently \(∼\\sim65% success\) because they suffer fromspatial amnesiaat complex intersections \(t=3t=3in Fig\.[9](https://arxiv.org/html/2605.26147#S6.F9)\)\. However, NBSR\-Mem perfectly matches the standard CNN\-GRU baseline, achieving a flawless 100\.0% success rate\. This proves that our hierarchical DAG structure is perfectly communicating with the GRU memory buffer, and that the evidential topology incurszero performance penalty\. NBSR\-Mem proves imposing interpretability doesn’t ruins accuracy\.
Table 5:Performance on the Partially Observable 2D Navigation Task\.Note:Success Rate measures the percentage of episodes where the agent safely reached the goal\. Abstention Rate measures the policy’s ability to safely halt when presented with an unseen OOD trap\.
##### Interpretable Policy Routingvs\.Black\-Box Control\.
When the standard CNN\-GRU policy navigates an intersection, it outputs an action withno mechanistic explanation\. In contrast, NBSR\-Mem physicalizes the decision\-making process\. By examining the audit trace of a successful evasive maneuver, we observe the model explicitly routing through semantic sub\-spaces:
```
Agent State: Approaching a wall directly ahead
(Remembering Left Sign from t=0).
-> Standard CNN-GRU Policy: Predicts Action [Turn Left] (Opaque mechanism)
-> NBSR-Mem Policy Trace:
[Phase 1] Root Router: Cruising (0.00) | Evasion (1.00)
[Phase 2] Depth 1 Evidence (Evasion Expert):
-> Up: 0.0 | Down: 0.0 | Left: 28.3 | Right: 0.0
-> (Entropy remains above threshold, routing deeper...)
[Phase 3] Depth 2 Additive Evidence (Leaf Experts):
-> Up: 0.0 | Down: 0.0 | Left: 75.0 | Right: 0.0
```
Thanks to theAuxiliary Concept Loss, Phase 1 perfectly isolates the abstract mode \(Evasion: 1\.00\)\. Further, because the model is resource\-rational, the evidence is sharp and sparse \- it outputs exactly0\.00\.0for incorrect actions, proving that the evidential regularizer successfully stopped all hallucinations\.
##### Resource\-Rational Adaptiveness and Computational Efficiency\.
Notably, the integration of temporal memory profoundly impacts dynamic routing depth\. The reactive NBSR policy, lacking historical context at intersections, exhibits high epistemic uncertainty\. Because a perfectly uncertain 4\-class Dirichlet distribution \(𝜶=\[1,1,1,1\]\\bm\{\\alpha\}=\[1,1,1,1\]\) yields a maximum differential entropy of roughly−1\.79\-1\.79, the reactive policy safely exceeds the−4\.5\-4\.5confidence threshold we set, forcing the network to route to Depth 2 \(Avg\. Depth: 1\.18\)323232The average depth is calculated across the entire test trajectory dataset\. Because the reactive policy lacks memory, it successfully early\-exits on the trivial empty corridors \(comprising the majority of states\), but is forced to traverse to Depth 2 upon reaching every intersection due to high epistemic uncertainty\. This blend of Depth 1 and Depth 2 routing yields an overall episodic average of 1\.18\.\.
In contrast, NBSR\-Mem utilizes its GRU hidden state to achieve extreme confidence\. As seen in the trace above, the Depth 1 Evasion Expert alone extracts a massive28\.328\.3evidence for the correct turn based on memory\. This profound reduction in uncertainty causes the entropy to plummet past the threshold, allowing the router to safely early\-exit\. Remarkably, the model achieves sufficient epistemic confidence to early\-exit on all states, yielding an average inference depth of exactly 1\.00 while maintaining 100% accuracy\. Thus, our architecture demonstrates that resolving partial observability not only recovers task performance but actively reduces the computational footprint of the hierarchical policy\.
##### Epistemic Safety in Out\-of\-Distribution Environments\.
The most critical failure mode of standard autonomous policies is their behavior in novel environments\. To test this, we inject an “Alien Hazard” \(a visual channel never present in training\) directly into the agent’s path\.
Because standard baselines utilize a softmax activation, the mathematical constraint \(∑p=1\\sum p=1\) forces the policy to confidently hallucinate\. In 100% of our OOD trials, the standard CNN\-GRU policy selected a movement action and crashed into the hazard \(Table[5](https://arxiv.org/html/2605.26147#S6.T5)\)\.
NBSR\-Mem natively solves this through Dirichlet thresholding, empirically validatingCorollary[1](https://arxiv.org/html/2605.26147#Thmcorollary1)\(Epistemic Collapse\)\. Due to ourLayerNormandtargeted weight decay, the untrained Alien channel weights were mathematically anchored to zero\. When the Alien Trap appeared, the network extracted only8\.398\.39total precision \- safely below the operating threshold of10\.010\.0we set\. Because the evidence remained stagnant, the epistemic uncertaintyuumathematically refused to collapse\. Instead of hallucinating, NBSR\-Mem recognized its own ignorance and triggered a safe fallback protocol \(100\.0% Halt\)\. Without these target\-masked regularizations, ID and OOD evidence distributions dangerously overlap; with them, NBSR\-Mem perfectly preserves the epistemic safety margin\. Ultimately, this demonstrates that NBSR\-Mem “knows when it doesn’t know”, providing a robust, uncertainty\-aware framework for safe autonomous control\.
### 6\.6NBSR as Active Learning in Bayesian Optimal Experimental Design
In many scientific and clinical domains, acquiring data is inherently constrained by stringent budgets\. For example, in pharmacological dose\-response modeling, a practitioner must actively decide which drug concentrations to measure in order to maximize understanding of the efficacy curve \(e\.g\. identifying theEC50EC\_\{50\}333333TheEC50EC\_\{50\}\(half maximal effective concentration\) represents the concentration of a drug or intervention that induces a response halfway between the baseline and the maximum possible effect, serving as a standard measure of potency\[[34](https://arxiv.org/html/2605.26147#bib.bib49)\]\.\) while minimizing laboratory costs\. This challenge is formalized byBayesian Optimal Experimental Design\(BOED\)\[[68](https://arxiv.org/html/2605.26147#bib.bib51)\], which seeks to select an experimental designξ\\xithat maximizes theExpected Information Gain\(EIG\)\. For an unobserved343434This target variable can be a particular parameter of interest, such as the latentEC50EC\_\{50\}threshold in pharmacodynamics, or a broader categorical outcome, such as the true underlying pathology in a clinical diagnostic setting\. Note that to maintain notational consistency with the NBSR framework, we useyyto denote the unobserved target variable \(commonly denoted asθ\\thetain standard BOED literature e\.g\. in\[[18](https://arxiv.org/html/2605.26147#bib.bib50)\]\) andeeto denote the prospective experimental outcome or evidence \(commonly denoted asyy\)\.target variableyyand a prospective measurementeeacquired under designξ\\xi, the EIG is mathematically defined as the mutual information betweenyyandee\[[18](https://arxiv.org/html/2605.26147#bib.bib50)\]:
EIG\(ξ\)=I\(y;e∣ξ\)=𝔼p\(e∣ξ\)\[ℋ\(p\(y\)\)−ℋ\(p\(y∣e,ξ\)\)\]\\text\{EIG\}\(\\xi\)=I\(y;e\\mid\\xi\)=\\mathbb\{E\}\_\{p\(e\\mid\\xi\)\}\\big\[\\mathcal\{H\}\(p\(y\)\)\-\\mathcal\{H\}\(p\(y\\mid e,\\xi\)\)\\big\]\(12\)whereℋ\\mathcal\{H\}denotes the entropy of the belief state\. In other words, BOED identifies the specific experimental design that, in expectation over all possible measurement outcomes, induces the maximum reduction in posterior entropy over the target latent variables\[[68](https://arxiv.org/html/2605.26147#bib.bib51),[18](https://arxiv.org/html/2605.26147#bib.bib50)\]\. Provided the underlying predictive model is accurate, this approach constitutes an optimal, albeit myopic \(one\-step\), data acquisition strategy from a strict information\-theoretic perspective\.
The NBSR framework natively operates as a sequential BOED engine\. At any inference steptt, the routing actionata\_\{t\}formally corresponds to selecting the next experimental designξ\\xi\(e\.g\. querying a specific sub\-specialist or running a distinct medical test\), and the extracted evidence𝐞t\\mathbf\{e\}\_\{t\}serves as the observation\. Historically, maximizing EIG in deep neural networks is computationally intractable because the posterior entropy requires evaluating an intractable marginal likelihood, often necessitating complex variational lower bounds\[[18](https://arxiv.org/html/2605.26147#bib.bib50)\]\. However, because the NBSR Bayesian state update is defined via exact conjugate addition \(𝜶t\+1=𝜶t\+𝐞t\\bm\{\\alpha\}\_\{t\+1\}=\\bm\{\\alpha\}\_\{t\}\+\\mathbf\{e\}\_\{t\}\), our framework bypasses this intractability entirely\. The exact, realized information gain achieved by querying expertvt\+1v\_\{t\+1\}is analytically computed as the drop in differential entropy:Δℋt=ℋ\(𝜶t\)−ℋ\(𝜶t\+1\)\\Delta\\mathcal\{H\}\_\{t\}=\\mathcal\{H\}\(\\bm\{\\alpha\}\_\{t\}\)\-\\mathcal\{H\}\(\\bm\{\\alpha\}\_\{t\+1\}\)\.
As established in Proposition[1](https://arxiv.org/html/2605.26147#Thmproposition1), the structural penaltyλ∑ℋ\(𝜶t\)\\lambda\\sum\\mathcal\{H\}\(\\bm\{\\alpha\}\_\{t\}\)in our training objective intrinsically forces the router to maximize this sequential EIG by penalizing delayed uncertainty reduction\. By introducing explicit, asymmetric measurement costs into the graph, we can extend NBSR into a fully resource\-rationalActive Learningframework\.
#### 6\.6\.1Task and Experimental Setup: Active Clinical Triage
To empirically validate NBSR as an active BOED agent, we extend the Structured Medical Diagnosis task \(Section[6\.3](https://arxiv.org/html/2605.26147#S6.SS3)\) into anActive Clinical Triageenvironment\. In standard machine learning, the model is given free, simultaneous access to all input features\. In this experiment, the patient’s true diagnostic state is initially hidden\. The agent only has access to free, baseline demographic data att=0t=0\.
The 132 symptoms are clustered into 5 distinct “Diagnostic Test Panels” \(e\.g\. Basic Metabolic Panel, Neurological Exam, Dermatological Swab\)\. Each panel is governed by a dedicated NBSR Expert node and has an associated financial or temporal costc\(v\)c\(v\), detailed in Table[6](https://arxiv.org/html/2605.26147#S6.T6)\.
Table 6:Diagnostic Test Panels and Associated Costs \(c\(v\)c\(v\)\)\. These values represent the asymmetric financial or temporal constraints required to execute each specific diagnostic panel\.##### The Active BOED Objective\.
The agent must sequentially select which tests to run to confidently diagnose the patient, balancing diagnostic accuracy against the cumulative cost of the required tests\. We modify the primary NBSR training objective \(Eq\.[7](https://arxiv.org/html/2605.26147#S4.E7)\) to incorporate a cost\-aware regularizer:
ℒBOED\(x,y∗\)=−log𝔼𝐩∼Dir\(𝜶T\)\[py∗\]\+λ∑vt∈𝒯ℋ\(𝜶t\)\+γ∑vt∈𝒯c\(vt\)\\mathcal\{L\}\_\{\\text\{BOED\}\}\(x,y^\{\*\}\)=\-\\log\\mathbb\{E\}\_\{\\mathbf\{p\}\\sim\\text\{Dir\}\(\\bm\{\\alpha\}\_\{T\}\)\}\[p\_\{y^\{\*\}\}\]\+\\lambda\\sum\_\{v\_\{t\}\\in\\mathcal\{T\}\}\\mathcal\{H\}\(\\bm\{\\alpha\}\_\{t\}\)\+\\gamma\\sum\_\{v\_\{t\}\\in\\mathcal\{T\}\}c\(v\_\{t\}\)\(13\)where the expectation is taken over the terminal Dirichlet belief state𝐩∼Dir\(𝜶T\)\\mathbf\{p\}\\sim\\text\{Dir\}\(\\bm\{\\alpha\}\_\{T\}\), andγ\\gammais a tunable hyperparameter dictating the budgetary strictness\. Notably, as derived previously in Eq\.[cc\.Eq\.17](https://arxiv.org/html/2605.26147#S3.Ex3), the expected marginal probability under the Dirichlet distribution expands exactly to:
𝔼𝐩∼Dir\(𝜶T\)\[py∗\]=αT,y∗∑k=1KαT,k\\mathbb\{E\}\_\{\\mathbf\{p\}\\sim\\text\{Dir\}\(\\bm\{\\alpha\}\_\{T\}\)\}\[p\_\{y^\{\*\}\}\]=\\frac\{\\alpha\_\{T,y^\{\*\}\}\}\{\\sum\_\{k=1\}^\{K\}\\alpha\_\{T,k\}\}which establishes that the first two terms of Eq\.[13](https://arxiv.org/html/2605.26147#S6.E13)are mathematically identical to the primary NBSR training objective \(Eq\.[7](https://arxiv.org/html/2605.26147#S4.E7)\)\. During the forward pass, the Router processes the patient’s current Dirichlet belief state𝜶t\\bm\{\\alpha\}\_\{t\}to select the test panel\. The policy is implicitly optimized during training to select tests that maximize information gain per unit cost, as governed by the cost\-aware regularizerγ\\gammaand the entropy penaltyλ\\lambdain Eq\.[13](https://arxiv.org/html/2605.26147#S6.E13)\. The process halts dynamically when the belief entropyℋ\(𝜶t\)\\mathcal\{H\}\(\\bm\{\\alpha\}\_\{t\}\)falls below the confidence thresholdη\\eta\.
##### Connection to the Standard EIG Objective\.
To understand why optimizing Eq\.[13](https://arxiv.org/html/2605.26147#S6.E13)effectively solves the active learning problem, we compare it directly to the standard BOED objective\. Recall that standard BOED seeks a designξ\\xithat maximizes the Expected Information Gain \(Eq\.[12](https://arxiv.org/html/2605.26147#S6.E12)\):
EIG\(ξ\)=I\(y;e∣ξ\)=𝔼p\(e∣ξ\)\[ℋ\(p\(y\)\)−ℋ\(p\(y∣e,ξ\)\)\]\\text\{EIG\}\(\\xi\)=I\(y;e\\mid\\xi\)=\\mathbb\{E\}\_\{p\(e\\mid\\xi\)\}\\big\[\\mathcal\{H\}\(p\(y\)\)\-\\mathcal\{H\}\(p\(y\\mid e,\\xi\)\)\\big\]In the NBSR framework, at steptt, the “prior” state is given by the Dirichlet distribution parameterized by𝜶t\\bm\{\\alpha\}\_\{t\}\. The routing decisionata\_\{t\}serves as the experimental designξ\\xi, and the extracted evidence𝐞t\\mathbf\{e\}\_\{t\}serves as the observationee\. The exact, realized information gain of taking stepttis the reduction in entropy:Δℋt=ℋ\(𝜶t\)−ℋ\(𝜶t\+1\)\\Delta\\mathcal\{H\}\_\{t\}=\\mathcal\{H\}\(\\bm\{\\alpha\}\_\{t\}\)\-\\mathcal\{H\}\(\\bm\{\\alpha\}\_\{t\+1\}\)\.
Minimizing the cumulative sum of entropiesλ∑t=0Tℋ\(𝜶t\)\\lambda\\sum\_\{t=0\}^\{T\}\\mathcal\{H\}\(\\bm\{\\alpha\}\_\{t\}\)in Eq\.[13](https://arxiv.org/html/2605.26147#S6.E13)forces the network to minimize the area under the entropy curve\. Algebraically, this heavily penalizes delayed uncertainty reduction, forcing the routing policyπθ\\pi\_\{\\theta\}to consistently select paths that maximize the expected stepwise drop𝔼\[Δℋt\]\\mathbb\{E\}\[\\Delta\\mathcal\{H\}\_\{t\}\]at every juncture\. Therefore, our regularized lossℒBOED\\mathcal\{L\}\_\{\\text\{BOED\}\}acts as a computationally tractable, end\-to\-end surrogate for maximizing the exact sequential EIG \(Eq\.[12](https://arxiv.org/html/2605.26147#S6.E12)\)\. This circumvents the intractable marginal likelihood integrations typically required in standard variational BOED\[[18](https://arxiv.org/html/2605.26147#bib.bib50)\]\.
##### Autoregressive Routing \(AR\-NBSR\) to Prevent Combinatorial Explosion\.
Deploying a static, feed\-forward DAG topology into this active triage environment would induce a massive combinatorial explosion\. Specifically, givenNNavailable diagnostic test panels \(the branching factor\) and a maximum required testing sequence ofLLsteps \(the tree depth\), a hardcoded decision tree would require an intractable𝒪\(NL\)\\mathcal\{O\}\(N^\{L\}\)leaf\-node complexity\. To resolve this, we configure the framework as anAutoregressive State Machine \(ASM\)\.
We formally define the AR\-NBSR as an ASM \- a sequential generative system where the current state𝒮t\\mathcal\{S\}\_\{t\}and the subsequent transition are conditioned on the history of prior observations\. We define the ASM by the tuple\(𝒮,𝒜,πθ,𝒰\)\(\\mathcal\{S\},\\mathcal\{A\},\\pi\_\{\\theta\},\\mathcal\{U\}\), where𝒮t=\(𝐡x,𝜶t,𝐦t\)\\mathcal\{S\}\_\{t\}=\(\\mathbf\{h\}\_\{x\},\\bm\{\\alpha\}\_\{t\},\\mathbf\{m\}\_\{t\}\)is the joint embedding of oracle features, current Dirichlet belief state, and action history;𝒜\\mathcal\{A\}is the action space of diagnostic panels;πθ\(at∣𝒮t\)\\pi\_\{\\theta\}\(a\_\{t\}\\mid\\mathcal\{S\}\_\{t\}\)is the routing policy; and𝒰\\mathcal\{U\}is the state\-update function defined by the Bayesian conjugate addition𝜶t\+1=𝜶t\+𝐞t\\bm\{\\alpha\}\_\{t\+1\}=\\bm\{\\alpha\}\_\{t\}\+\\mathbf\{e\}\_\{t\}\. While the state evolution inherently satisfies the Markov property \(as the posterior belief𝜶t\+1\\bm\{\\alpha\}\_\{t\+1\}is conditioned solely on the previous state𝜶t\\bm\{\\alpha\}\_\{t\}and the instantaneous evidence𝐞t\\mathbf\{e\}\_\{t\}\), we term this processautoregressivebecause the model uses its own past evidential accumulations to regress towards a terminal hypothesis\.
Conceptually, this maps perfectly to the cyclical architecture illustrated in Fig\.[11](https://arxiv.org/html/2605.26147#S6.F11): instead of routing a patient down a physical hallway of isolated sub\-specialists \(a deep DAG\), a single reusableGlobal Triage Routerand a flat array of independent Expert Networks operate in a recurrent, time\-indexed loop\. Consequently, static architecturaldepthis replaced by recurrenttime\. The diagnostic trajectory unfolds dynamically as an iterative loop, for example:
1. 1\.t=0t=0:The Global Router evaluates the initial baseline prior𝜶0\\bm\{\\alpha\}\_\{0\}alongside the demographic features, and selects the first test \(e\.g\. Panel 2\)\. Panel 2 extracts evidence𝐞0\\mathbf\{e\}\_\{0\}, updating the belief to𝜶1=𝜶0\+𝐞0\\bm\{\\alpha\}\_\{1\}=\\bm\{\\alpha\}\_\{0\}\+\\mathbf\{e\}\_\{0\}\.
2. 2\.t=1t=1:The exact same Global Router evaluates the newly updated state𝜶1\\bm\{\\alpha\}\_\{1\}\. Recognizing lingering uncertainty regarding a specific pathology, it selects a complementary test \(e\.g\. Panel 4\)\. Panel 4 extracts evidence𝐞1\\mathbf\{e\}\_\{1\}, updating the belief to𝜶2=𝜶1\+𝐞1\\bm\{\\alpha\}\_\{2\}=\\bm\{\\alpha\}\_\{1\}\+\\mathbf\{e\}\_\{1\}\.
3. 3\.t=2t=2:The router evaluates𝜶2\\bm\{\\alpha\}\_\{2\}\. The newly accumulated evidence causes the differential entropyℋ\(𝜶2\)\\mathcal\{H\}\(\\bm\{\\alpha\}\_\{2\}\)to plunge below the confidence thresholdη\\eta\. The recurrent loop terminates dynamically, yielding the final diagnosis without consuming further testing budget\.
This autoregressive formulation elegantly bounds the parameter count to a single router andNNexperts, while granting the agent a dynamic, virtually infinite sequential action space\.
Global Oracle𝐡x\\mathbf\{h\}\_\{x\}Global Routerπθ\(at∣𝒮t\)\\pi\_\{\\theta\}\(a\_\{t\}\\mid\\mathcal\{S\}\_\{t\}\)Panel 1Panelata\_\{t\}\(Active Expert\)⋮PanelNN\+\+Belief Memoryz−1z^\{\-1\}ℋ\(𝜶t\)<η\\mathcal\{H\}\(\\bm\{\\alpha\}\_\{t\}\)<\\eta?FinalDiagnosisData𝐡x\\mathbf\{h\}\_\{x\}Actionata\_\{t\}Evidence𝐞t\\mathbf\{e\}\_\{t\}Prior𝜶t\\bm\{\\alpha\}\_\{t\}Posterior𝜶t\+1\\bm\{\\alpha\}\_\{t\+1\}YesNo \(Continue Loop\)
Figure 11:Illustration of the Autoregressive NBSR \(AR\-NBSR\) framework\. Unlike a static feed\-forward DAG, the Global Triage Router iteratively queries a flat array of available diagnostic panels\. The belief state𝜶t\\bm\{\\alpha\}\_\{t\}acts as a recurrent memory buffer, denoted by the Unit Delay Operator \(z−1z^\{\-1\}\) from discrete\-time signal processing, which caches the newly updated posterior to serve as the prior for the subsequent routing step\. By sequentially accumulating evidence𝐞t\\mathbf\{e\}\_\{t\}via conjugate addition until the diagnostic entropy falls below the safety thresholdη\\eta, this temporal feedback loop naturally averts the𝒪\(NL\)\\mathcal\{O\}\(N^\{L\}\)combinatorial explosion\.Clinical Metaphor: The Active Triage RoomTo intuitively map the mathematical terminologies of AR\-NBSR to a real\-world clinical setting, consider a hospital triage room:•The Global Router \(The Lead Diagnostician\):Rather than passing the patient down a physical hallway of isolated sub\-specialists \(a deep DAG\), a single Lead Diagnostician sits at a central desk\. Their only job is to evaluate the current information and decidewhich test to order next\.•The Expert Networks \(The Test Panels\):Down the hall are 5 distinct testing rooms \(e\.g\. Blood Panel, MRI, Respiratory Swab\)\. These are the expert modules\.•The Dirichlet Belief Stateαt\\bm\{\\alpha\}\_\{t\}\(The Clipboard\):The Diagnostician holds a clipboard tracking the probability of 41 possible diseases\. It starts blank \(maximum entropy\)\.•Autoregressive Time \(The Loop\):The Diagnostician sends the patient to Room 2\. The result \(evidence𝐞0\\mathbf\{e\}\_\{0\}\) comes back, and the clipboard is updated \(𝜶1\\bm\{\\alpha\}\_\{1\}\)\. Realizing more information is needed to distinguish between two highly probable diseases, the Diagnostician sends the patient to Room 4 \(𝐞1→𝜶2\\mathbf\{e\}\_\{1\}\\to\\bm\{\\alpha\}\_\{2\}\)\.•The Confidence Thresholdη\\eta\(The Stop Condition\):Once the clipboard indicates sufficient certainty for a specific disease, the Diagnostician halts the testing loop, sparing the patient the financial cost and time of visiting the remaining 3 rooms\.
##### Baselines\.
We benchmark the NBSR active learning agent against two standard sequential acquisition strategies:
1. 1\.Random Allocation:Iteratively queries random test panels until the entropy thresholdη\\etais reached\.
2. 2\.Greedy EIG \(Myopic BOED\):A standard active learning baseline that evaluates the expected entropy drop for all available tests at steptt, and selects the test that maximizes the immediate gain\-to\-cost ratioΔℋtc\(v\)\\frac\{\\Delta\\mathcal\{H\}\_\{t\}\}\{c\(v\)\}\. While optimal for a single isolated step, myopic BOED is well\-known to lack long\-horizon planning\.
#### 6\.6\.2Results and Analysis
##### Training Dynamics and Convergence\.
The training stability of the AR\-NBSR model is evidenced by the convergence curves in Fig\.[12](https://arxiv.org/html/2605.26147#S6.F12)\. We observe a precipitous decline in NLL during the initial 40 epochs, indicating that the expert networks successfully internalized the diagnostic features of the noisy clinical dataset\. Concurrently, theAverage Patient Costdescends from the maximum budgetary ceiling \($125\\mathdollar 125\) to a stable operational regime \(∼$48\\sim\\mathdollar 48\)\. This confirms that the AR\-NBSR agent reaches a stable, resource\-rational equilibrium, balancing predictive accuracy against the cost\-awareness regularizer without exhibiting signs of training instability or divergence\.
Figure 12:AR\-NBSR Training Convergence\. The simultaneous decline in NLL \(Blue\) and Average Patient Cost \(Red\) indicates the agent effectively learns to balance diagnostic precision with budgetary constraints\.
##### Cost\-Accuracy Pareto Frontier\.
To demonstrate the model’s resource\-rationality, we evaluate the learned routing policy across a range of operational confidence thresholdsη\\eta\. By sweeping the early\-exiting thresholdη\\etaduring inference \(effectively adjusting the agent’s internal confidence requirement\), we generate a Pareto frontier plotting diagnostic accuracy against the average cumulative cost per patient \(Fig\.[13](https://arxiv.org/html/2605.26147#S6.F13)\)\. Note that while the budgetary penaltyγ\\gammais used duringtrainingto encourage cost\-aware feature acquisition, the Pareto frontier specifically maps the trade\-off between the agent’stest\-timeconfidence \(governed byη\\eta\) and the resulting economic and predictive performance\. As expected, Random Allocation requires nearly all tests \(maximum cost\) to reach baseline accuracy\. The Greedy EIG baseline performs well initially but frequently selects cheap, uninformative tests that require subsequent expensive follow\-ups to resolve lingering uncertainty\.
AR\-NBSR drastically outperforms both baselines, establishing itself as anon\-myopic, resource\-rational, active triage agent\. Because the routers are trained end\-to\-end via the Gumbel\-Softmax STE, the routing policy moves beyond myopic step\-wise gains to learn a global diagnostic trajectory\. By generating a high\-resolution frontier using 30 linearly spaced thresholds, we observe a distinct “elbow” in the AR\-NBSR curve\. It achieves its peak diagnostic accuracy \(∼\\sim90\.5%\) at an average cost of approximately$45\\mathdollar 45to$50\\mathdollar 50, whereas the Greedy baseline requires upwards of$60\\mathdollar 60to$80\\mathdollar 80to reach the same predictive asymptote\. This organically demonstrates that occasionally selecting a moderately expensive test early in the sequence is optimal if it mathematically guarantees a massive reduction in the downstream diagnostic search space\.
Figure 13:Cost\-Accuracy Pareto Frontier\. AR\-NBSR \(Blue\) demonstrates superior resource\-rationality compared to Greedy \(Orange\) and Random \(Grey\) baselines, hitting the peak accuracy asymptote at a significantly lower average cumulative cost\.
##### Dynamic Uncertainty Reduction\.
A hallmark of Bayesian Optimal Experimental Design is that the optimal measurement strategy is inextricably dependent on the prior\. Our audit trail \(Fig\.[14](https://arxiv.org/html/2605.26147#S6.F14)\) provides anon\-myopic proofof this behavior\. The fact that entropy drops significantly across multiple steps \(t=0t=0tot=5t=5\) confirms the model is not relying on a ”magic bullet” test but rather gathering a chain of evidence, where each test updates the belief state, and the subsequent test selection is conditioned on the evidence gathered previously\. The threshold crossing att=2t=2demonstrates the model performing dynamic early\-exiting: it recognized it had attained sufficient information to be confident and halted testing to save resources\.
Figure 14:Dynamic Entropy Reduction for a single patient trajectory\. The agent sequentially reduces belief entropy until the confidence thresholdη\\etais breached, demonstrating non\-myopic sequential planning\.
##### Conclusion on Empirical Performance\.
As demonstrated in our cost\-accuracy Pareto Frontier, AR\-NBSR exhibits strong resource\-rationality\. While the Greedy baseline provides a strong myopic upper bound, AR\-NBSR achieves competitive diagnostic accuracy with significantly lower average computation per inference\. Further, the audit trail confirms that AR\-NBSR does not merely execute a fixed sequence, but dynamically reduces belief entropy across multiple time steps, naturally terminating upon reaching the confidence thresholdη\\eta\. This confirms that our framework is capable of long\-horizon planning, moving beyond the myopic limitations of standard active learning\.
## 7Discussion
This NBSR framework represents a fundamental shift from static, monolithic deep learning towards an active, sequential feature acquisition paradigm\. By unifying hierarchical routing, exact Bayesian updating, and discrete conditional execution, NBSR naturally aligns neural inference with human cognitive strategies\. Below, we discuss the architectural implications, training dynamics, and theoretical extensions of this framework\.
### 7\.1Active Knowledge Retrieval vs\. Information Bottlenecking
Standard deep neural networks frequently suffer from information bottlenecking\[[71](https://arxiv.org/html/2605.26147#bib.bib81)\]\. When data is presented entirely at the input layer, critical low\-level details can be prematurely discarded as information propagates through sequential transformations\. For example, early layers in a convolutional neural network may extract broad spatial features while irreparably discarding granular textures that might be vital for downstream, fine\-grained classification\. While residual connections partially mitigate this by passing signals across contiguous layers, they do not preserve the unadulterated global context across the entire depth of the network\.
In contrast, the NBSR framework fundamentally bypasses this degradation by treating the initial dense embedding as aPersistent Knowledge Oracle\(𝐡x\\mathbf\{h\}\_\{x\}\)\. Instead of processing the data in a single monolithic pass where information is passively lost, the decision graph actively and iteratively queries this oracle\. In this sense, NBSR shares conceptual similarities with Joint\-Embedding Predictive Architectures \(JEPA\)\[[50](https://arxiv.org/html/2605.26147#bib.bib74)\]: both systems seek to filter environmental noise and extract abstract, task\-relevant true signals\. However, in NBSR, this extraction is performed sequentially and information retrieval is done repeatedly\. Because each nodal decision\-maker selectively chooses exactly what knowledge to retrieve from the global oracle to resolve its immediate uncertainty, this process inherently functions as anattention mechanismover the semantic space\. Every local nodal expertf\(𝐡x;𝐖vt\+1,𝐛vt\+1\)f\(\\mathbf\{h\}\_\{x\};\\mathbf\{W\}\_\{v\_\{t\+1\}\},\\mathbf\{b\}\_\{v\_\{t\+1\}\}\)serves as a task\-specific probe, dynamically projecting the static, high\-dimensional representation into a localized, outcome\-discriminative evidence space\.
### 7\.2Hierarchical Epistemic Capacity and Safe Inference
To accurately simulate a hierarchical taxonomy, different levels of abstract detail must be absorbed at different depths\. As empirically demonstrated in our Toy Experiment \(Section[6\.1](https://arxiv.org/html/2605.26147#S6.SS1)\), this is mathematically enforced by manipulating theepistemic capacityof the local experts\. By applying a scaledSigmoidactivation to intermediate nodes, we impose a strict ”confidence budget”, forcing mid\-level experts to act as generally broad, hesitant super\-categories\. Conversely, terminal leaf experts utilize an unboundedSoftplusactivation, allowing them to inject massive evidence to crystallize a sharply defined decision boundary\.
A critical architectural imperative in this deployment is the strict prohibition against normalizing these intermediate Dirichlet evidence vectors \(𝜶t\\bm\{\\alpha\}\_\{t\}\)\. In standard neural control policies, latent normalization \(e\.g\. via a continuous Softmax\) is routinely applied to bound representations\. However, in Subjective Logic, the total Dirichlet precisionα0\\alpha\_\{0\}serves as the fundamental mathematical anchor for epistemic uncertainty, defined asu=K/α0u=K/\\alpha\_\{0\}\([cc\.Eq\.24](https://arxiv.org/html/2605.26147#S3.Ex2)\)\. Artificially normalizing𝜶t\\bm\{\\alpha\}\_\{t\}forcesα0\\alpha\_\{0\}to a constant, permanently destroying the framework’s native ability to trigger out\-of\-distribution \(OOD\) safety abstentions\. By ensuring that evidence remains strictly raw, additive, and unbounded, NBSR successfully avoids hallucination vulnerabilities, guaranteeing safe epistemic collapse \(Corollary[1](https://arxiv.org/html/2605.26147#Thmcorollary1)\) as validated in our POMDP Control and Language Modeling experiments\.
### 7\.3Decoupling of Training and Inference
Our NBSR framework explicitly decouples the training and inference regimes to simultaneously maximizeskill acquisitionandcomputational efficiency\. In practice, we must balance two critical geometric aspects of the routing tree:
##### Depth and the “Lazy Optimization” Problem\.
During training, ideally the dynamic early\-exiting mechanism \(controlled by the entropy thresholdη\\eta\) must be intentionally disabled, forcing every sample to traverse the full depth of the graph\. Neural networks are inherently “lazy” optimizers; they seek the path of least mathematical resistance to minimize the loss function\. If early exiting were permitted during the learning phase, intermediate experts would quickly learnjust enoughrudimentary features to push the differential entropy slightly belowη\\eta, halting the forward pass prematurely\. Consequently, deeper leaf experts would experiencegradient starvation, rendering the extended architecture functionally dead\. By forcing full\-depth traversals during training, we guarantee that every level of the graph receives the necessary gradients to acquire specialized skills\. At inference time, theη\\etathreshold is activated, allowing the model to resource\-rationally reap the computational savings of these rigorously acquired skills\.
##### Width and the “Group Project” Problem\.
Conversely, while utilizing the fulldepthis essential during training, utilizing the fullwidth\(i\.e\. soft routing\) is catastrophic\. If the framework utilized standard continuous relaxation where an input traverses every branch and the final update is a probability\-weighted average of all terminal leaf experts, the model would fall intorepresentation collapse\[[70](https://arxiv.org/html/2605.26147#bib.bib22)\]\. This dynamic is analogous to a “group project” where the final grade is based on a blended average: because the network relies on the ensemble to minimize the loss, individual experts fail to specialize, instead learning generic, overlapping, “blurry” features\. NBSR resolves this via the Gumbel\-Softmax Straight\-Through Estimator \(STE\)\. By enforcing hard discrete routing during the forward pass, the selected expert is forced to assume 100% responsibility for the prediction, mathematically guaranteeing absolute specialization\. During the backward pass, the continuous relaxation allows the routing policy to evaluate counterfactual probabilities and learn optimal pruning strategies without ever incurring the computational cost of the unselected experts\.
### 7\.4NBSR as an Markov Decision Process \(MDP\)
While NBSR is heavily evaluated on discriminative classification tasks, we can formally characterize its sequential inference process as a discrete\-time Markov Decision Process \(MDP\)\[[63](https://arxiv.org/html/2605.26147#bib.bib45)\]defined by the tuple\(𝒮,𝒜,𝒫,ℛ\)\(\\mathcal\{S\},\\mathcal\{A\},\\mathcal\{P\},\\mathcal\{R\}\)\. Here, the state space𝒮\\mathcal\{S\}is defined by the augmented belief state𝜶t\\bm\{\\alpha\}\_\{t\}and the persistent oracle context𝐡x\\mathbf\{h\}\_\{x\}; the action space𝒜\\mathcal\{A\}corresponds to the discrete selection of expert nodes via the routerπθ\(at∣𝒮t\)\\pi\_\{\\theta\}\(a\_\{t\}\\mid\\mathcal\{S\}\_\{t\}\); the transition dynamics𝒫\(𝒮t\+1∣𝒮t,at\)\\mathcal\{P\}\(\\mathcal\{S\}\_\{t\+1\}\\mid\\mathcal\{S\}\_\{t\},a\_\{t\}\)are governed by the exact Bayesian conjugate update𝜶t\+1=𝜶t\+𝐞t\\bm\{\\alpha\}\_\{t\+1\}=\\bm\{\\alpha\}\_\{t\}\+\\mathbf\{e\}\_\{t\}; and the reward functionℛ\\mathcal\{R\}is implicitly defined as the negative differential entropy−ℋ\(𝜶t\)\-\\mathcal\{H\}\(\\bm\{\\alpha\}\_\{t\}\)plus any associated measurement costsc\(vt\)c\(v\_\{t\}\)\.
In this MDP, the optimal policyπ∗\\pi^\{\*\}seeks to maximize the cumulative information gain while minimizing expenditure, effectively treating diagnostic triage and sequential reasoning as an optimal control problem\. This MDP formulation reveals that our NBSR architecture is fundamentally aPolicy Networklearning to navigate an epistemic state\-space, providing the algorithmic foundation for the Autoregressive State Machine \(AR\-NBSR\) demonstrated in our Active Clinical Triage experiment \(Section[6\.6](https://arxiv.org/html/2605.26147#S6.SS6)\)\.
### 7\.5Modular Skill Acquisition and Unbounded Topologies
As the field of artificial intelligence shifts towards modular, agentic systems, architectures must evolve to support composable skill ensembles\. In NBSR, each expert network acts as an isolated, highly specialized cognitive skill\. The routing tree ensembles these skills both horizontally and sequentially, selectively activating only the specific sub\-regions required to process the current input\. This mechanism offers a computationally lightweight, Bayesian alternative to standard dense attention mechanisms\.
Looking forward, this framework naturally accommodates several rigorous theoretical extensions:
- •Neuro\-Symbolic Integration via Product of Experts:the sequential evidence accumulation natively supports the injection of externally derived rules or heuristics\. Future work could calibrate the node Dirichlet distribution via aProduct of Experts353535PoE is a machine learning framework that models a probability distribution by combining the output from several simpler probability models \(experts\), this is achieved by multiplying their probability distributions and renormalizing\.\(PoE\)\[[30](https://arxiv.org/html/2605.26147#bib.bib75),[35](https://arxiv.org/html/2605.26147#bib.bib18)\], using a distinct correction term derived from the contextual side information \(𝜺t\\bm\{\\varepsilon\}\_\{t\}\)\.
- •Unbounded Width via Dirichlet Process:Instead of routing over a fixedKK\-dimensional categorical distribution, routers could be reformulated using aDirichlet Processvia a differentiablestick\-breaking mechanism363636Stick\-breaking processes are widely used procedures for constructing random discrete distributions in statistics and machine learning\[[22](https://arxiv.org/html/2605.26147#bib.bib19)\]\. Due to their intuitive construction and computational tractability, they have become popular in modern Bayesian nonparametric inference, serving as the foundational mechanism for models such as the Dirichlet and Pitman\-Yor processes\.\[[38](https://arxiv.org/html/2605.26147#bib.bib76)\]\. In this paradigm, a router leaves a residual probability mass to instantiate a newly discovered expert, allowing the horizontal branching factor to grow infinitely and organically if existing experts fail to sufficiently minimize the negative log\-likelihood\.
- •Unbounded Depth via Information\-Gain Splitting:To dynamically expand vertical depth, we can apply node\-splitting rules from traditional decision trees \(e\.g\. CART\[[7](https://arxiv.org/html/2605.26147#bib.bib77)\]\) to our neural modules\. If a terminal expert’s average differential entropyℋ\(𝜶T\)\\mathcal\{H\}\(\\bm\{\\alpha\}\_\{T\}\)plateaus above a critical threshold \(which indicates it lacks the capacity to resolve the remaining uncertainty of its assigned data sub\-population\), the network can dynamically freeze that expert, spawn a new router in its place, and branch into newly initialized child experts to further partition the semantic space\.
## 8Conclusion
We introducedNeural Bayesian Sequential Routing \(NBSR\), a framework for turning neural prediction into a sequential, uncertainty\-aware process of evidence acquisition\. The central idea is simple but powerful: instead of producing a one\-shot softmax decision from a monolithic network, NBSR routes an input through a hierarchy of neural experts, lets each visited expert query a persistent global representation𝐡x\\mathbf\{h\}\_\{x\}, and accumulates the resulting positive evidence vectors as Dirichlet pseudo\-counts\. In this way, prediction is no longer merely a class\-score computation; it becomes a forward\-causal belief trajectory whose intermediate states, uncertainty levels, and expert contributions can be inspected\.
The main methodological contribution is the unification of three ingredients that are usually treated separately: conditional neural execution, evidential uncertainty, and sequential decision\-making\. The Dirichlet\-Categorical update𝜶t\+1=𝜶t\+𝐞t\\bm\{\\alpha\}\_\{t\+1\}=\\bm\{\\alpha\}\_\{t\}\+\\mathbf\{e\}\_\{t\}gives the model an explicit Bayesian belief state over the final outcome space; the Gumbel\-Softmax Straight\-Through estimator makes discrete, path\-dependent routing trainable end\-to\-end; and the entropy/precision of the evolving Dirichlet state provides a native mechanism for early exiting, abstention, and cost\-aware acquisition\. This yields a model that is simultaneously predictive, modular, interpretable, and resource\-sensitive\.
The theoretical analysis clarified what this construction guarantees and what it does not\. Under strictly positive evidence extraction, NBSR guarantees monotone growth of total Dirichlet precision and a corresponding upper bound on marginal variance, formalizing the intended “hypothesis sharpening” effect\. The consistency result shows that, under idealized capacity and optimization assumptions, the expected terminal Dirichlet prediction can recover the Bayes\-optimal conditional distribution\. The topological analysis further explains how graph depth and width trade bias against variance, while entropy\-based early exiting converts this global architectural trade\-off into a sample\-dependent one\.
Empirically, the experiments show that NBSR is most valuable when accuracy, uncertainty, computational selectivity, and interpretability must be considered together\. On CIFAR\-10, NBSR achieved strong classification performance, reaching96\.74%96\.74\\%accuracy with an ECE of0\.0150\.015, while retaining a full routing audit trail and enabling nearly90%90\\%of test samples to exit at Depth 1 without degrading accuracy\. The sequential precision and entropy plots also directly validated the predicted belief\-sharpening behavior\. In structured medical diagnosis, NBSR matched the flat MLP in diagnostic accuracy while providing patient\-specific routing traces and path\-dependent feature attributions, although tree\-based methods remained strong raw\-performance baselines on this tabular dataset\. In language modeling, NBSR preserved near\-baseline perplexity on the controlled grammar task while exposing token\-level syntactic/semantic routing decisions, highlighting interpretability and uncertainty structure rather than raw throughput dominance\.
The extensions further show that NBSR is not limited to static classification\. With recurrent memory, NBSR\-Mem matched the black\-box CNN\-GRU controller on the partially observable navigation task while retaining interpretable policy traces and OOD halting behavior\. In the active clinical triage setting, the autoregressive AR\-NBSR formulation demonstrated that the same belief\-state machinery can act as a cost\-aware experimental\-design policy: it learned to reduce diagnostic entropy over multiple steps and reached the same predictive asymptote at a substantially lower average cost than the greedy baseline\. These results support the view of NBSR as a general architecture for sequential, resource\-rational neural inference\.
Overall, NBSR suggests a different design principle for modular AI systems: experts should not merely be softly averaged, but selectively queried; uncertainty should not be an auxiliary diagnostic, but part of the state that controls computation; and interpretability should not be reconstructed post hoc, but generated by the forward pass itself\. Future work will scale the framework to larger vocabularies and real\-world sequential decision tasks, improve calibration of the Dirichlet evidence in high\-dimensional settings, and investigate learnable topologies, dynamic experts, neuro\-symbolic evidence injection, and nonparametric extensions such as Dirichlet\-process routing\. These directions may further develop NBSR into a practical foundation for transparent, adaptive, and resource\-rational agentic AI\.
## References
- \[1\]M\. Abadi, A\. Agarwal, P\. Barham, E\. Brevdo, Z\. Chen, C\. Citro, G\. S\. Corrado, A\. Davis, J\. Dean, M\. Devin, S\. Ghemawat, I\. Goodfellow, A\. Harp, G\. Irving, M\. Isard, Y\. Jia, R\. Jozefowicz, L\. Kaiser, M\. Kudlur, J\. Levenberg, D\. Mané, R\. Monga, S\. Moore, D\. Murray, C\. Olah, M\. Schuster, J\. Shlens, B\. Steiner, I\. Sutskever, K\. Talwar, P\. Tucker, V\. Vanhoucke, V\. Vasudevan, F\. Viégas, O\. Vinyals, P\. Warden, M\. Wattenberg, M\. Wicke, Y\. Yu, and X\. Zheng\(2015\)TensorFlow: large\-scale machine learning on heterogeneous systems\.Note:Software available from tensorflow\.orgExternal Links:[Link](https://www.tensorflow.org/)Cited by:[§4\.5](https://arxiv.org/html/2605.26147#S4.SS5.p3.4)\.
- \[2\]\(1994\-03\)Learning long\-term dependencies with gradient descent is difficult\.Trans\. Neur\. Netw\.5\(2\),pp\. 157–166\.External Links:ISSN 1045\-9227,[Link](https://doi.org/10.1109/72.279181),[Document](https://dx.doi.org/10.1109/72.279181)Cited by:[§6\.5\.1](https://arxiv.org/html/2605.26147#S6.SS5.SSS1.Px3.p1.1)\.
- \[3\]Y\. Bengio, N\. Léonard, and A\. Courville\(2013\)Estimating or propagating gradients through stochastic neurons for conditional computation\.External Links:1308\.3432,[Link](https://arxiv.org/abs/1308.3432)Cited by:[§C\.3](https://arxiv.org/html/2605.26147#A3.SS3.p2.2),[§1](https://arxiv.org/html/2605.26147#S1.p5.1),[§3](https://arxiv.org/html/2605.26147#S3.SS0.SSS0.Px3.p2.3)\.
- \[4\]C\. M\. Bishop\(2006\)Pattern recognition and machine learning\.1 edition,Information Science and Statistics,Springer,New York, NY\.External Links:ISBN 978\-0\-387\-31073\-2Cited by:[§1](https://arxiv.org/html/2605.26147#S1.p4.1)\.
- \[5\]C\. Blundell, J\. Cornebise, K\. Kavukcuoglu, and D\. Wierstra\(2015\)Weight uncertainty in neural networks\.InProceedings of the 32nd International Conference on Machine Learning \- Volume 37,ICML’15,pp\. 1613–1622\.Cited by:[§2](https://arxiv.org/html/2605.26147#S2.SS0.SSS0.Px1.p1.1)\.
- \[6\]E\. B\. Bonawitz, D\. Ferranti, R\. Saxe, A\. Gopnik, A\. N\. Meltzoff, J\. Woodward, and L\. E\. Schulz\(2010\)Just do it? investigating the gap between prediction and action in toddlers’ causal inferences\.Cognition115\(1\),pp\. 104–117\.External Links:ISSN 0010\-0277,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.cognition.2009.12.001),[Link](https://www.sciencedirect.com/science/article/pii/S0010027709002947)Cited by:[§1](https://arxiv.org/html/2605.26147#S1.p1.1)\.
- \[7\]L\. Breiman, J\. H\. Friedman, R\. A\. Olshen, and C\. J\. Stone\(1984\)Classification and regression trees\.1 edition,Chapman and Hall/CRC\.External Links:[Document](https://dx.doi.org/10.1201/9781315139470)Cited by:[Appendix E](https://arxiv.org/html/2605.26147#A5.SS0.SSS0.Px3.p1.1),[Appendix E](https://arxiv.org/html/2605.26147#A5.p1.1),[§2](https://arxiv.org/html/2605.26147#S2.SS0.SSS0.Px3.p1.1),[§4\.3](https://arxiv.org/html/2605.26147#S4.SS3.p1.8),[3rd item](https://arxiv.org/html/2605.26147#S7.I1.i3.p1.1)\.
- \[8\]T\. Chen and C\. Guestrin\(2016\)XGBoost: a scalable tree boosting system\.InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,KDD ’16,New York, NY, USA,pp\. 785–794\.External Links:ISBN 9781450342322,[Link](https://doi.org/10.1145/2939672.2939785),[Document](https://dx.doi.org/10.1145/2939672.2939785)Cited by:[§6\.3\.1](https://arxiv.org/html/2605.26147#S6.SS3.SSS1.p4.3)\.
- \[9\]K\. M\. Collins, I\. Sucholutsky, U\. Bhatt,et al\.\(2024\)Building machines that learn and think with people\.Nature Human Behaviour8,pp\. 1851–1863\.External Links:[Document](https://dx.doi.org/10.1038/s41562-024-01991-9)Cited by:[§1](https://arxiv.org/html/2605.26147#S1.p2.1)\.
- \[10\]G\. Cybenko\(1989\)Approximation by superpositions of a sigmoidal function\.Mathematics of Control, Signals, and Systems2,pp\. 303–314\.External Links:[Document](https://dx.doi.org/10.1007/BF02551274)Cited by:[§5\.2](https://arxiv.org/html/2605.26147#S5.SS2.1.p1.1),[§5\.2](https://arxiv.org/html/2605.26147#S5.SS2.p1.1)\.
- \[11\]J\. Deng, W\. Dong, R\. Socher, L\. Li, K\. Li, and L\. Fei\-Fei\(2009\-06\)ImageNet: A large\-scale hierarchical image database\.In2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops \(CVPR Workshops\),Vol\.,Los Alamitos, CA, USA,pp\. 248–255\.External Links:ISSN 1063\-6919,[Document](https://dx.doi.org/10.1109/CVPR.2009.5206848),[Link](https://doi.ieeecomputersociety.org/10.1109/CVPR.2009.5206848)Cited by:[footnote 21](https://arxiv.org/html/2605.26147#footnote21)\.
- \[12\]S\. Desai and G\. Durrett\(2020\)Calibration of pre\-trained transformers\.External Links:2003\.07892,[Link](https://arxiv.org/abs/2003.07892)Cited by:[§6\.4\.1](https://arxiv.org/html/2605.26147#S6.SS4.SSS1.p1.3),[§6\.4\.2](https://arxiv.org/html/2605.26147#S6.SS4.SSS2.Px3.p1.1)\.
- \[13\]A\. Dosovitskiy, L\. Beyer, A\. Kolesnikov, D\. Weissenborn, X\. Zhai, T\. Unterthiner, M\. Dehghani, M\. Minderer, G\. Heigold, S\. Gelly, J\. Uszkoreit, and N\. Houlsby\(2021\)An image is worth 16x16 words: transformers for image recognition at scale\.External Links:2010\.11929,[Link](https://arxiv.org/abs/2010.11929)Cited by:[footnote 18](https://arxiv.org/html/2605.26147#footnote18)\.
- \[14\]M\. Falk and F\. Marohn\(1993\)Von mises conditions revisited\.The Annals of Probability21\(3\),pp\. 1310–1328\.Cited by:[§B\.3](https://arxiv.org/html/2605.26147#A2.SS3.p1.1)\.
- \[15\]A\. Fan, M\. Lewis, and Y\. Dauphin\(2018\-07\)Hierarchical neural story generation\.InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),I\. Gurevych and Y\. Miyao \(Eds\.\),Melbourne, Australia,pp\. 889–898\.External Links:[Link](https://aclanthology.org/P18-1082/),[Document](https://dx.doi.org/10.18653/v1/P18-1082)Cited by:[§6\.4\.1](https://arxiv.org/html/2605.26147#S6.SS4.SSS1.p1.3)\.
- \[16\]W\. Fedus, B\. Zoph, and N\. Shazeer\(2022\-01\)Switch transformers: scaling to trillion parameter models with simple and efficient sparsity\.J\. Mach\. Learn\. Res\.23\(1\)\.External Links:ISSN 1532\-4435Cited by:[§1](https://arxiv.org/html/2605.26147#S1.p3.1),[footnote 24](https://arxiv.org/html/2605.26147#footnote24)\.
- \[17\]R\. A\. Fisher and L\. H\. C\. Tippett\(1928\)Limiting forms of the frequency distribution of the largest and smallest member of a sample\.Mathematical Proceedings of the Cambridge Philosophical Society24\(2\),pp\. 180–190\.External Links:[Document](https://dx.doi.org/10.1017/S0305004100015681)Cited by:[§B\.3](https://arxiv.org/html/2605.26147#A2.SS3.p1.1)\.
- \[18\]A\. Foster, M\. Jankowiak, E\. Bingham, P\. Horsfall, Y\. W\. Teh, T\. Rainforth, and N\. Goodman\(2019\)Variational bayesian optimal experimental design\.InAdvances in Neural Information Processing Systems,H\. Wallach, H\. Larochelle, A\. Beygelzimer, F\. d'Alché\-Buc, E\. Fox, and R\. Garnett \(Eds\.\),Vol\.32,pp\.\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2019/file/d55cbf210f175f4a37916eafe6c04f0d-Paper.pdf)Cited by:[§1](https://arxiv.org/html/2605.26147#S1.p5.1),[§6\.6\.1](https://arxiv.org/html/2605.26147#S6.SS6.SSS1.Px2.p2.4),[§6\.6](https://arxiv.org/html/2605.26147#S6.SS6.p1.7),[§6\.6](https://arxiv.org/html/2605.26147#S6.SS6.p1.8),[§6\.6](https://arxiv.org/html/2605.26147#S6.SS6.p2.7),[footnote 34](https://arxiv.org/html/2605.26147#footnote34)\.
- \[19\]M\. Fréchet\(1927\)Sur la loi de probabilité de l’écart maximum\.Annales de la Société Polonaise de Mathématique6\(1\),pp\. 93–116\.Cited by:[§B\.3](https://arxiv.org/html/2605.26147#A2.SS3.p1.1)\.
- \[20\]Y\. Gal and Z\. Ghahramani\(2016\)Dropout as a bayesian approximation: representing model uncertainty in deep learning\.InProceedings of the 33rd International Conference on International Conference on Machine Learning \- Volume 48,ICML’16,pp\. 1050–1059\.Cited by:[§2](https://arxiv.org/html/2605.26147#S2.SS0.SSS0.Px1.p1.1)\.
- \[21\]A\. Gelman, J\. B\. Carlin, H\. S\. Stern, D\. B\. Dunson, A\. Vehtari, and D\. B\. Rubin\(2013\)Bayesian data analysis\.3 edition,Chapman and Hall/CRC,New York\.External Links:[Document](https://dx.doi.org/10.1201/b16018),ISBN 9780429113079Cited by:[§1](https://arxiv.org/html/2605.26147#S1.p4.1),[§2](https://arxiv.org/html/2605.26147#S2.SS0.SSS0.Px1.p2.1)\.
- \[22\]M\. F\. Gil\-Leyva, A\. Lijoi, R\. H\. Mena, and I\. Prünster\(2026\)Markov stick\-breaking processes\.External Links:2601\.16561,[Link](https://arxiv.org/abs/2601.16561)Cited by:[footnote 36](https://arxiv.org/html/2605.26147#footnote36)\.
- \[23\]B\. V\. Gnedenko\(1943\)Sur la distribution limite du terme maximum d’une série aléatoire\.Annals of Mathematics44\(3\),pp\. 423–453\.External Links:[Document](https://dx.doi.org/10.2307/1968974)Cited by:[§B\.3](https://arxiv.org/html/2605.26147#A2.SS3.p1.1)\.
- \[24\]Google\(2026\)Gemini 3 developer guide\.Note:Google AI for Developers\.External Links:[Link](https://ai.google.dev/gemini-api/docs/gemini-3)Cited by:[footnote 19](https://arxiv.org/html/2605.26147#footnote19)\.
- \[25\]A\. Graves\(2017\)Adaptive computation time for recurrent neural networks\.External Links:1603\.08983,[Link](https://arxiv.org/abs/1603.08983)Cited by:[§1](https://arxiv.org/html/2605.26147#S1.p2.1)\.
- \[26\]C\. Guo, G\. Pleiss, Y\. Sun, and K\. Q\. Weinberger\(2017\)On calibration of modern neural networks\.InProceedings of the 34th International Conference on Machine Learning \- Volume 70,ICML’17,pp\. 1321–1330\.Cited by:[§6\.4\.1](https://arxiv.org/html/2605.26147#S6.SS4.SSS1.p1.3),[§6\.4\.2](https://arxiv.org/html/2605.26147#S6.SS4.SSS2.Px3.p1.1)\.
- \[27\]Y\. Han, G\. Huang, S\. Song, L\. Yang, H\. Wang, and Y\. Wang\(2022\-11\)Dynamic Neural Networks: A Survey\.IEEE Transactions on Pattern Analysis & Machine Intelligence44\(11\),pp\. 7436–7456\.External Links:ISSN 1939\-3539,[Document](https://dx.doi.org/10.1109/TPAMI.2021.3117837),[Link](https://doi.ieeecomputersociety.org/10.1109/TPAMI.2021.3117837)Cited by:[§1](https://arxiv.org/html/2605.26147#S1.p2.1)\.
- \[28\]K\. He, X\. Zhang, S\. Ren, and J\. Sun\(2016\)Deep residual learning for image recognition\.In2016 IEEE Conference on Computer Vision and Pattern Recognition \(CVPR\),Vol\.,pp\. 770–778\.External Links:[Document](https://dx.doi.org/10.1109/CVPR.2016.90)Cited by:[§F\.1](https://arxiv.org/html/2605.26147#A6.SS1.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.26147#S1.p2.1),[§6\.2\.1](https://arxiv.org/html/2605.26147#S6.SS2.SSS1.Px2.p1.6),[footnote 18](https://arxiv.org/html/2605.26147#footnote18)\.
- \[29\]D\. Hendrycks and K\. Gimpel\(2017\)A baseline for detecting misclassified and out\-of\-distribution examples in neural networks\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=Hkg4TI9xl)Cited by:[item 4](https://arxiv.org/html/2605.26147#S1.I1.i4.p1.1),[§6\.4\.2](https://arxiv.org/html/2605.26147#S6.SS4.SSS2.Px3.p1.1)\.
- \[30\]G\. E\. Hinton\(2002\-08\)Training products of experts by minimizing contrastive divergence\.Neural Comput\.14\(8\),pp\. 1771–1800\.External Links:ISSN 0899\-7667,[Link](https://doi.org/10.1162/089976602760128018),[Document](https://dx.doi.org/10.1162/089976602760128018)Cited by:[1st item](https://arxiv.org/html/2605.26147#S7.I1.i1.p1.1)\.
- \[31\]S\. Hochreiter and J\. Schmidhuber\(1997\-11\)Long short\-term memory\.Neural Comput\.9\(8\),pp\. 1735–1780\.External Links:ISSN 0899\-7667,[Link](https://doi.org/10.1162/neco.1997.9.8.1735),[Document](https://dx.doi.org/10.1162/neco.1997.9.8.1735)Cited by:[§6\.5\.1](https://arxiv.org/html/2605.26147#S6.SS5.SSS1.Px3.p1.1)\.
- \[32\]A\. Holtzman, J\. Buys, L\. Du, M\. Forbes, and Y\. Choi\(2020\)The curious case of neural text degeneration\.External Links:1904\.09751,[Link](https://arxiv.org/abs/1904.09751)Cited by:[§6\.4\.1](https://arxiv.org/html/2605.26147#S6.SS4.SSS1.p1.3)\.
- \[33\]K\. Hornik\(1991\)Approximation capabilities of multilayer feedforward networks\.Neural Networks4\(2\),pp\. 251–257\.External Links:ISSN 0893\-6080,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/0893-6080%2891%2990009-T),[Link](https://www.sciencedirect.com/science/article/pii/089360809190009T)Cited by:[§5\.2](https://arxiv.org/html/2605.26147#S5.SS2.1.p1.1),[§5\.2](https://arxiv.org/html/2605.26147#S5.SS2.p1.1)\.
- \[34\]Y\. Huang\(2025\)Sampling via gaussian mixture approximations\.External Links:2509\.25232,[Link](https://arxiv.org/abs/2509.25232)Cited by:[footnote 33](https://arxiv.org/html/2605.26147#footnote33)\.
- \[35\]Y\. Huang\(2026\)VJEPA: variational joint embedding predictive architectures as probabilistic world models\.External Links:2601\.14354,[Link](https://arxiv.org/abs/2601.14354)Cited by:[1st item](https://arxiv.org/html/2605.26147#S7.I1.i1.p1.1)\.
- \[36\]Y\. Huang\(May 2026\)On the information bottleneck of VJEPA\.OpenReview\.External Links:[Link](https://openreview.net/forum?id=S7hzZHbMwY)Cited by:[§4\.9](https://arxiv.org/html/2605.26147#S4.SS9.SSS0.Px2.p2.1)\.
- \[37\]A\. Hussein, M\. M\. Gaber, E\. Elyan, and C\. Jayne\(2017\-04\)Imitation learning: a survey of learning methods\.ACM Comput\. Surv\.50\(2\)\.External Links:ISSN 0360\-0300,[Link](https://doi.org/10.1145/3054912),[Document](https://dx.doi.org/10.1145/3054912)Cited by:[§6\.5\.1](https://arxiv.org/html/2605.26147#S6.SS5.SSS1.Px1.p1.7)\.
- \[38\]H\. Ishwaran and L\. F\. James\(2001\)Gibbs sampling methods for stick\-breaking priors\.Journal of the American Statistical Association96\(453\),pp\. 161–173\.External Links:[Document](https://dx.doi.org/10.1198/016214501750332758),[Link](https://doi.org/10.1198/016214501750332758),https://doi\.org/10\.1198/016214501750332758Cited by:[2nd item](https://arxiv.org/html/2605.26147#S7.I1.i2.p1.1)\.
- \[39\]R\. A\. Jacobs, M\. I\. Jordan, S\. J\. Nowlan, and G\. E\. Hinton\(1991\)Adaptive mixtures of local experts\.Neural Computation3\(1\),pp\. 79–87\.External Links:[Document](https://dx.doi.org/10.1162/neco.1991.3.1.79)Cited by:[§1](https://arxiv.org/html/2605.26147#S1.p3.1)\.
- \[40\]E\. Jang, S\. Gu, and B\. Poole\(2017\)Categorical reparameterization with gumbel\-softmax\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=rkE3y85ee)Cited by:[§B\.3](https://arxiv.org/html/2605.26147#A2.SS3.p2.4),[§C\.3](https://arxiv.org/html/2605.26147#A3.SS3.p2.2),[Appendix C](https://arxiv.org/html/2605.26147#A3.p1.1),[§1](https://arxiv.org/html/2605.26147#S1.p5.1),[§3](https://arxiv.org/html/2605.26147#S3.SS0.SSS0.Px3.p1.3),[§4\.5](https://arxiv.org/html/2605.26147#S4.SS5.p4.4)\.
- \[41\]Z\. Ji, N\. Lee, R\. Frieske, T\. Yu, D\. Su, Y\. Xu, E\. Ishii, Y\. J\. Bang, A\. Madotto, and P\. Fung\(2023\-03\)Survey of hallucination in natural language generation\.ACM Comput\. Surv\.55\(12\)\.External Links:ISSN 0360\-0300,[Link](https://doi.org/10.1145/3571730),[Document](https://dx.doi.org/10.1145/3571730)Cited by:[§6\.4\.1](https://arxiv.org/html/2605.26147#S6.SS4.SSS1.p1.3)\.
- \[42\]M\.I\. Jordan and R\.A\. Jacobs\(1993\)Hierarchical mixtures of experts and the em algorithm\.InProceedings of 1993 International Conference on Neural Networks \(IJCNN\-93\-Nagoya, Japan\),Vol\.2,pp\. 1339–1344 vol\.2\.External Links:[Document](https://dx.doi.org/10.1109/IJCNN.1993.716791)Cited by:[Appendix E](https://arxiv.org/html/2605.26147#A5.SS0.SSS0.Px3.p1.1),[Appendix E](https://arxiv.org/html/2605.26147#A5.SS0.SSS0.Px4.p1.1),[Appendix E](https://arxiv.org/html/2605.26147#A5.p1.1),[§2](https://arxiv.org/html/2605.26147#S2.SS0.SSS0.Px3.p1.1)\.
- \[43\]A\. Jøsang\(2016\)Subjective logic: a formalism for reasoning under uncertainty\.1 edition,Artificial Intelligence: Foundations, Theory, and Algorithms,Springer Cham\.External Links:ISBN 978\-3\-319\-42335\-7,[Document](https://dx.doi.org/10.1007/978-3-319-42337-1)Cited by:[§A\.5](https://arxiv.org/html/2605.26147#A1.SS5.p1.1),[§3](https://arxiv.org/html/2605.26147#S3.SS0.SSS0.Px1.p3.6)\.
- \[44\]D\. Jurafsky and J\. H\. Martin\(2000\)Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition\.1st edition,Prentice Hall\.Cited by:[§6\.4\.1](https://arxiv.org/html/2605.26147#S6.SS4.SSS1.p3.1)\.
- \[45\]L\. P\. Kaelbling, M\. L\. Littman, and A\. R\. Cassandra\(1998\)Planning and acting in partially observable stochastic domains\.Artificial Intelligence101\(1\),pp\. 99–134\.External Links:ISSN 0004\-3702,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/S0004-3702%2898%2900023-X),[Link](https://www.sciencedirect.com/science/article/pii/S000437029800023X)Cited by:[§1](https://arxiv.org/html/2605.26147#S1.p5.1)\.
- \[46\]D\. Kahneman\(2011\)Thinking, fast and slow\.Penguin Books,London\.Cited by:[§1](https://arxiv.org/html/2605.26147#S1.p1.1)\.
- \[47\]D\. P\. Kingma and J\. Ba\(2017\)Adam: a method for stochastic optimization\.External Links:1412\.6980,[Link](https://arxiv.org/abs/1412.6980)Cited by:[§J\.4](https://arxiv.org/html/2605.26147#A10.SS4.p1.2),[§F\.1](https://arxiv.org/html/2605.26147#A6.SS1.SSS0.Px3.p1.3),[§G\.4](https://arxiv.org/html/2605.26147#A7.SS4.p1.1),[§H\.4](https://arxiv.org/html/2605.26147#A8.SS4.p1.1),[§I\.4](https://arxiv.org/html/2605.26147#A9.SS4.p1.1),[§6\.2\.1](https://arxiv.org/html/2605.26147#S6.SS2.SSS1.Px2.p1.6)\.
- \[48\]B\. Lambert\(2018\-04\)A student’s guide to bayesian statistics\.SAGE Publications Ltd\.Cited by:[§2](https://arxiv.org/html/2605.26147#S2.SS0.SSS0.Px1.p2.1)\.
- \[49\]Y\. LeCun, Y\. Bengio, and G\. Hinton\(2015\)Deep learning\.Nature521,pp\. 436–444\.External Links:[Document](https://dx.doi.org/10.1038/nature14539)Cited by:[§1](https://arxiv.org/html/2605.26147#S1.p2.1)\.
- \[50\]Y\. LeCun\(2022\)A path towards autonomous machine intelligence version 0\.9\.2, 2022\-06\-27\.Open Review62\(1\),pp\. 1–62\.Cited by:[§7\.1](https://arxiv.org/html/2605.26147#S7.SS1.p2.2)\.
- \[51\]Z\. C\. Lipton\(2018\-09\)The mythos of model interpretability\.Commun\. ACM61\(10\),pp\. 36–43\.External Links:ISSN 0001\-0782,[Link](https://doi.org/10.1145/3233231),[Document](https://dx.doi.org/10.1145/3233231)Cited by:[§1](https://arxiv.org/html/2605.26147#S1.p2.1)\.
- \[52\]S\. M\. Lundberg and S\. Lee\(2017\)A unified approach to interpreting model predictions\.InProceedings of the 31st International Conference on Neural Information Processing Systems,NIPS’17,Red Hook, NY, USA,pp\. 4768–4777\.External Links:ISBN 9781510860964Cited by:[§6\.3\.2](https://arxiv.org/html/2605.26147#S6.SS3.SSS2.Px1.p1.1),[§6\.3\.2](https://arxiv.org/html/2605.26147#S6.SS3.SSS2.Px2.p1.1)\.
- \[53\]C\. J\. Maddison, A\. Mnih, and Y\. W\. Teh\(2017\)The concrete distribution: a continuous relaxation of discrete random variables\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=S1jE5L5gl)Cited by:[§B\.3](https://arxiv.org/html/2605.26147#A2.SS3.p2.4),[Appendix C](https://arxiv.org/html/2605.26147#A3.p1.1),[§1](https://arxiv.org/html/2605.26147#S1.p5.1),[§3](https://arxiv.org/html/2605.26147#S3.SS0.SSS0.Px3.p1.3),[§4\.5](https://arxiv.org/html/2605.26147#S4.SS5.p4.4)\.
- \[54\]C\. J\. Maddison, D\. Tarlow, and T\. Minka\(2014\)A\* sampling\.InProceedings of the 28th International Conference on Neural Information Processing Systems \- Volume 2,NIPS’14,Cambridge, MA, USA,pp\. 3086–3094\.Cited by:[§B\.3](https://arxiv.org/html/2605.26147#A2.SS3.p2.4)\.
- \[55\]M\. P\. Marcus, B\. Santorini, and M\. A\. Marcinkiewicz\(1993\)Building a large annotated corpus of English: the Penn Treebank\.Computational Linguistics19\(2\),pp\. 313–330\.External Links:[Link](https://aclanthology.org/J93-2004/)Cited by:[§6\.4\.1](https://arxiv.org/html/2605.26147#S6.SS4.SSS1.p3.1)\.
- \[56\]P\. Moehrke and Y\. Huang\(2024\)Bayesian neural networks in mortality modelling\.The Actuarial\.Cited by:[§2](https://arxiv.org/html/2605.26147#S2.SS0.SSS0.Px1.p1.1)\.
- \[57\]Y\. Netzer, T\. Wang, A\. Coates, A\. Bissacco, B\. Wu, and A\. Ng\(2011\)Reading digits in natural images with unsupervised feature learning\.External Links:[Link](http://ufldl.stanford.edu/housenumbers/nips2011_housenumbers.pdf)Cited by:[§6\.2\.2](https://arxiv.org/html/2605.26147#S6.SS2.SSS2.Px5.p1.1)\.
- \[58\]M\. Oquab, T\. Darcet, T\. Moutakanni, H\. V\. Vo, M\. Szafraniec, V\. Khalidov, P\. Fernandez, D\. HAZIZA, F\. Massa, A\. El\-Nouby, M\. Assran, N\. Ballas, W\. Galuba, R\. Howes, P\. Huang, S\. Li, I\. Misra, M\. Rabbat, V\. Sharma, G\. Synnaeve, H\. Xu, H\. Jegou, J\. Mairal, P\. Labatut, A\. Joulin, and P\. Bojanowski\(2024\)DINOv2: learning robust visual features without supervision\.Transactions on Machine Learning Research\.Note:Featured CertificationExternal Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=a68SUt6zFt)Cited by:[§4\.9](https://arxiv.org/html/2605.26147#S4.SS9.SSS0.Px2.p1.3)\.
- \[59\]Y\. Ovadia, E\. Fertig, J\. Ren, Z\. Nado, D\. Sculley, S\. Nowozin, J\. V\. Dillon, B\. Lakshminarayanan, and J\. Snoek\(2019\)Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift\.InProceedings of the 33rd International Conference on Neural Information Processing Systems,Cited by:[§6\.4\.2](https://arxiv.org/html/2605.26147#S6.SS4.SSS2.Px3.p1.1)\.
- \[60\]R\. Pascanu, T\. Mikolov, and Y\. Bengio\(2013\)On the difficulty of training recurrent neural networks\.InProceedings of the 30th International Conference on International Conference on Machine Learning \- Volume 28,ICML’13,pp\. III–1310–III–1318\.Cited by:[§6\.5\.1](https://arxiv.org/html/2605.26147#S6.SS5.SSS1.Px3.p1.1)\.
- \[61\]A\. Paszke, S\. Gross, F\. Massa, A\. Lerer, J\. Bradbury, G\. Chanan, T\. Killeen, Z\. Lin, N\. Gimelshein, L\. Antiga, A\. Desmaison, A\. Köpf, E\. Yang, Z\. DeVito, M\. Raison, A\. Tejani, S\. Chilamkurthy, B\. Steiner, L\. Fang, J\. Bai, and S\. Chintala\(2019\)PyTorch: an imperative style, high\-performance deep learning library\.InProceedings of the 33rd International Conference on Neural Information Processing Systems,Cited by:[§4\.5](https://arxiv.org/html/2605.26147#S4.SS5.p3.4)\.
- \[62\]D\. A\. Pomerleau\(1988\)ALVINN: an autonomous land vehicle in a neural network\.InProceedings of the 2nd International Conference on Neural Information Processing Systems,NIPS’88,Cambridge, MA, USA,pp\. 305–313\.Cited by:[§I\.4](https://arxiv.org/html/2605.26147#A9.SS4.p1.1),[§6\.5\.1](https://arxiv.org/html/2605.26147#S6.SS5.SSS1.Px1.p1.7)\.
- \[63\]M\. L\. Puterman\(1994\)Markov decision processes: discrete stochastic dynamic programming\.Wiley Series in Probability and Statistics,John Wiley & Sons\.External Links:ISBN 978\-0\-471\-61977\-2,[Document](https://dx.doi.org/10.1002/9780470316887)Cited by:[§6\.5](https://arxiv.org/html/2605.26147#S6.SS5.p1.1),[§7\.4](https://arxiv.org/html/2605.26147#S7.SS4.p1.11)\.
- \[64\]A\. Radford, J\. W\. Kim, C\. Hallacy, A\. Ramesh, G\. Goh, S\. Agarwal, G\. Sastry, A\. Askell, P\. Mishkin, J\. Clark, G\. Krueger, and I\. Sutskever\(2021\)Learning transferable visual models from natural language supervision\.External Links:2103\.00020,[Link](https://arxiv.org/abs/2103.00020)Cited by:[§4\.9](https://arxiv.org/html/2605.26147#S4.SS9.SSS0.Px2.p1.3)\.
- \[65\]H\. Raiffa and R\. Schlaifer\(1961\)Applied statistical decision theory\.Harvard University Press,Boston\.Cited by:[§2](https://arxiv.org/html/2605.26147#S2.SS0.SSS0.Px1.p2.1)\.
- \[66\]M\. T\. Ribeiro, S\. Singh, and C\. Guestrin\(2016\)”Why should i trust you?”: explaining the predictions of any classifier\.InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,KDD ’16,New York, NY, USA,pp\. 1135–1144\.External Links:ISBN 9781450342322,[Link](https://doi.org/10.1145/2939672.2939778),[Document](https://dx.doi.org/10.1145/2939672.2939778)Cited by:[§6\.3\.2](https://arxiv.org/html/2605.26147#S6.SS3.SSS2.Px1.p1.1)\.
- \[67\]C\. Rudin\(2019\)Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead\.Nature Machine Intelligence1,pp\. 206–215\.External Links:[Document](https://dx.doi.org/10.1038/s42256-019-0048-x)Cited by:[§1](https://arxiv.org/html/2605.26147#S1.p2.1)\.
- \[68\]P\. Sebastiani and H\. P\. Wynn\(2000\)Maximum entropy sampling and optimal bayesian experimental design\.Journal of the Royal Statistical Society\. Series B \(Statistical Methodology\)62\(1\),pp\. 145–157\.External Links:ISSN 13697412, 14679868,[Link](http://www.jstor.org/stable/2680683)Cited by:[§1](https://arxiv.org/html/2605.26147#S1.p5.1),[§6\.6](https://arxiv.org/html/2605.26147#S6.SS6.p1.7),[§6\.6](https://arxiv.org/html/2605.26147#S6.SS6.p1.8)\.
- \[69\]M\. Sensoy, L\. Kaplan, and M\. Kandemir\(2018\)Evidential deep learning to quantify classification uncertainty\.InProceedings of the 32nd International Conference on Neural Information Processing Systems,NIPS’18,Red Hook, NY, USA,pp\. 3183–3193\.Cited by:[§A\.5](https://arxiv.org/html/2605.26147#A1.SS5.p1.1),[item 3](https://arxiv.org/html/2605.26147#A9.I6.i3.p1.1),[item 4](https://arxiv.org/html/2605.26147#S1.I1.i4.p1.1),[§2](https://arxiv.org/html/2605.26147#S2.SS0.SSS0.Px2.p1.1),[§3](https://arxiv.org/html/2605.26147#S3.SS0.SSS0.Px1.p1.2),[§3](https://arxiv.org/html/2605.26147#S3.SS0.SSS0.Px1.p3.6),[item 3](https://arxiv.org/html/2605.26147#S6.I10.i3.p1.1),[footnote 26](https://arxiv.org/html/2605.26147#footnote26)\.
- \[70\]N\. Shazeer, \*\. Mirhoseini, \*\. Maziarz, A\. Davis, Q\. Le, G\. Hinton, and J\. Dean\(2017\)Outrageously large neural networks: the sparsely\-gated mixture\-of\-experts layer\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=B1ckMDqlg)Cited by:[§1](https://arxiv.org/html/2605.26147#S1.p3.1),[§6\.4\.1](https://arxiv.org/html/2605.26147#S6.SS4.SSS1.p5.1),[§6\.5\.1](https://arxiv.org/html/2605.26147#S6.SS5.SSS1.Px3.p1.1),[§7\.3](https://arxiv.org/html/2605.26147#S7.SS3.SSS0.Px2.p1.1),[footnote 24](https://arxiv.org/html/2605.26147#footnote24)\.
- \[71\]R\. Shwartz\-Ziv and N\. Tishby\(2017\)Opening the black box of deep neural networks via information\.External Links:1703\.00810,[Link](https://arxiv.org/abs/1703.00810)Cited by:[§7\.1](https://arxiv.org/html/2605.26147#S7.SS1.p1.1)\.
- \[72\]R\. von Mises\(1936\)La distribution de la plus grande de n valeurs\.Revue de Mathématiques de l’Union Interbalkanique1,pp\. 141–160\.Cited by:[§B\.3](https://arxiv.org/html/2605.26147#A2.SS3.p1.1)\.
- \[73\]A\. Wald\(1945\)Sequential tests of statistical hypotheses\.The Annals of Mathematical Statistics16\(2\),pp\. 117–186\.External Links:ISSN 00034851,[Link](http://www.jstor.org/stable/2235829)Cited by:[§1](https://arxiv.org/html/2605.26147#S1.p1.1)\.
- \[74\]S\. R\. Waterhouse and A\. J\. Robinson\(1995\)Constructive algorithms for hierarchical mixtures of experts\.InProceedings of the 9th International Conference on Neural Information Processing Systems,NIPS’95,Cambridge, MA, USA,pp\. 584–590\.Cited by:[Appendix E](https://arxiv.org/html/2605.26147#A5.SS0.SSS0.Px3.p1.1),[Appendix E](https://arxiv.org/html/2605.26147#A5.SS0.SSS0.Px4.p1.1),[§2](https://arxiv.org/html/2605.26147#S2.SS0.SSS0.Px3.p1.1)\.
- \[75\]G\. K\. Zipf\(1949\)Human behavior and the principle of least effort: an introduction to human ecology\.Addison\-Wesley\.Cited by:[footnote 30](https://arxiv.org/html/2605.26147#footnote30)\.
## Appendix AThe Dirichlet Distribution
This section provides a formal overview of the Dirichlet distribution, detailing the mathematical properties that make it the foundational engine for the uncertainty\-tracking and evidence\-accumulating mechanisms in our Bayesian routing framework\.
### A\.1Definition and Support
The Dirichlet distribution is a family of continuous multivariate probability distributions parameterized by a vector of positive reals𝜶=\(α1,α2,…,αK\)∈ℝ\+K\\bm\{\\alpha\}=\(\\alpha\_\{1\},\\alpha\_\{2\},\\dots,\\alpha\_\{K\}\)\\in\\mathbb\{R\}^\{K\}\_\{\+\}\. It is defined over the\(K−1\)\(K\-1\)\-dimensional probability simplex, meaning its support consists ofKK\-dimensional vectors𝐩\\mathbf\{p\}such thatpk∈\(0,1\)p\_\{k\}\\in\(0,1\)and∑k=1Kpk=1\\sum\_\{k=1\}^\{K\}p\_\{k\}=1\.
The probability density function \(PDF\) is given by:
f\(𝐩∣𝜶\)=1B\(𝜶\)∏k=1Kpkαk−1f\(\\mathbf\{p\}\\mid\\bm\{\\alpha\}\)=\\frac\{1\}\{\\text\{B\}\(\\bm\{\\alpha\}\)\}\\prod\_\{k=1\}^\{K\}p\_\{k\}^\{\\alpha\_\{k\}\-1\}\(14\)whereB\(𝜶\)\\text\{B\}\(\\bm\{\\alpha\}\)is themultivariate Beta function, expressed in terms of the Gamma functionΓ\(⋅\)\\Gamma\(\\cdot\)as:
B\(𝜶\)=∏k=1KΓ\(αk\)Γ\(∑k=1Kαk\)\\text\{B\}\(\\bm\{\\alpha\}\)=\\frac\{\\prod\_\{k=1\}^\{K\}\\Gamma\(\\alpha\_\{k\}\)\}\{\\Gamma\\left\(\\sum\_\{k=1\}^\{K\}\\alpha\_\{k\}\\right\)\}\(15\)In our framework, the vector𝐩\\mathbf\{p\}represents the latent true probability distribution over theKKfinal outcomes, and𝛂t\\bm\{\\alpha\}\_\{t\}is the concentration parameter which represents the model’s intermediate belief state at routing steptt\.
### A\.2Conjugacy to the Categorical Distribution
A defining property of the Dirichlet distribution is that it is the conjugate prior for theCategoricalandMultinomialdistributions\. If the prior distribution over a discrete distribution𝐩\\mathbf\{p\}isDir\(𝜶\)\\text\{Dir\}\(\\bm\{\\alpha\}\), and we observe a set of categorical occurrences represented by a count vector𝐜\\mathbf\{c\}, the posterior distribution remains a Dirichlet distribution, updated simply by vector addition:
P\(𝐩∣𝐜,𝜶\)=Dir\(𝜶\+𝐜\)P\(\\mathbf\{p\}\\mid\\mathbf\{c\},\\bm\{\\alpha\}\)=\\text\{Dir\}\(\\bm\{\\alpha\}\+\\mathbf\{c\}\)\(16\)
Our NBSR framework leverages this conjugacy to perform differentiable Bayesian updating\. Instead of discrete counts, our neural experts extract continuous, strictly positive evidence vectors𝐞t\\mathbf\{e\}\_\{t\}\. We treat these continuous vectors as pseudo\-counts, allowing the belief state to be updated deterministically at each node via𝛂t\+1=𝛂t\+𝐞t\\bm\{\\alpha\}\_\{t\+1\}=\\bm\{\\alpha\}\_\{t\}\+\\mathbf\{e\}\_\{t\}\.
### A\.3Expectation, Variance, and Covariance \(The “Sharpening” Effect\)
Letα0=∑k=1Kαk\\alpha\_\{0\}=\\sum\_\{k=1\}^\{K\}\\alpha\_\{k\}be the sum of the concentration parameters, often referred to as the precision or inverse\-variance parameter\. The expected value of thekk\-th component of the distribution𝐩\\mathbf\{p\}is strictly proportional to its relative share of the total concentration:
𝔼\[pk\]=αkαk\+\(α0−αk\)=αkα0\\mathbb\{E\}\[p\_\{k\}\]=\\frac\{\\alpha\_\{k\}\}\{\\alpha\_\{k\}\+\(\\alpha\_\{0\}\-\\alpha\_\{k\}\)\}=\\frac\{\\alpha\_\{k\}\}\{\\alpha\_\{0\}\}\(17\)The variance of each component is given by:
Var\[pk\]=αk\(α0−αk\)α02\(α0\+1\)\\text\{Var\}\[p\_\{k\}\]=\\frac\{\\alpha\_\{k\}\(\\alpha\_\{0\}\-\\alpha\_\{k\}\)\}\{\\alpha\_\{0\}^\{2\}\(\\alpha\_\{0\}\+1\)\}\(18\)Furthermore, for any two distinct componentsi≠ji\\neq j, the covariance is defined as:
Cov\[pi,pj\]=−αiαjα02\(α0\+1\)\\text\{Cov\}\[p\_\{i\},p\_\{j\}\]=\\frac\{\-\\alpha\_\{i\}\\alpha\_\{j\}\}\{\\alpha\_\{0\}^\{2\}\(\\alpha\_\{0\}\+1\)\}\(19\)The negative sign in Eq\.[19](https://arxiv.org/html/2605.26147#A1.E19)reflects the inherent competition between categories under the simplex constraint∑pk=1\\sum p\_\{k\}=1; as the evidence for one class increases, the evidence for alternatives must necessarily decrease to maintain the sum\. Consequently, the covariance matrix of a Dirichlet distribution is singular, reflecting the linear dependence between the components of𝐩\\mathbf\{p\}\.
The relationship between these moments and the total precisionα0\\alpha\_\{0\}is central to our NBSR framework\. As the model traverses deeper into the decision graph and accumulates evidence𝐞t\\mathbf\{e\}\_\{t\},α0\\alpha\_\{0\}strictly increases\. Consequently, both the variance and the magnitude of the covariance shrink at a rate of𝒪\(1/α0\)\\mathcal\{O\}\(1/\\alpha\_\{0\}\), mathematically guaranteeing the progressive “sharpening” of the decision boundary where the probability mass concentrates into a single, high\-confidence outcome\.
### A\.4Marginal Distributions
The marginal distribution of each individual componentpkp\_\{k\}from the Dirichlet\-distributed vector𝐩\\mathbf\{p\}follows a Beta distribution\. Using the total concentration parameterα0\\alpha\_\{0\}, this is defined as:
pk∼Beta\(αk,α0−αk\)p\_\{k\}\\sim\\text\{Beta\}\(\\alpha\_\{k\},\\alpha\_\{0\}\-\\alpha\_\{k\}\)\(20\)
Isolating the marginal distribution is highly useful during inference, particularly in complex domains such as medical diagnosis\. It allows the system to extract the exact confidence intervals for a specific target outcomekk\(e\.g\. a specific disease\) independently of the remainingK−1K\-1alternative classes\.
### A\.5Epistemic Uncertainty and Subjective Logic
Rooted in the framework of Subjective Logic\[[43](https://arxiv.org/html/2605.26147#bib.bib73)\], the Dirichlet distribution provides a native quantification of confidence that explicitly separates evidence\-based certainty from epistemic ignorance\. This mathematical bridge was the cornerstone formulation that introduced Evidential Deep Learning to the neural network community\[[69](https://arxiv.org/html/2605.26147#bib.bib28)\]\.
In Subjective Logic, a belief state overKKmutually exclusive classes is modeled by assigning a belief massbk≥0b\_\{k\}\\geq 0to each class alongside an overall epistemic uncertainty massu≥0u\\geq 0\. These components are strictly constrained to sum to unity:
u\+∑k=1Kbk=1u\+\\sum\_\{k=1\}^\{K\}b\_\{k\}=1\(21\)
To map this theoretical belief state to a tractable probabilistic framework, it is tied to the Dirichlet distribution\. The concentration parametersαk\\alpha\_\{k\}are defined as a combination of the accumulated evidenceek≥0e\_\{k\}\\geq 0and a non\-informative prior weightWW:
αk=ek\+Wak\\alpha\_\{k\}=e\_\{k\}\+Wa\_\{k\}\(22\)whereaka\_\{k\}is the base rate \(prior probability\) of classkk\. Assuming a uniform prior across theKKclasses, we setak=1/Ka\_\{k\}=1/Kand the prior weight to exactlyW=KW=K\. This simplifies the concentration parameter toαk=ek\+1\\alpha\_\{k\}=e\_\{k\}\+1\.
Consequently, the total precisionα0\\alpha\_\{0\}\(the sum of all concentration parameters\) expands to:
α0=∑k=1Kαk=∑k=1K\(ek\+1\)=∑k=1Kek\+K\\alpha\_\{0\}=\\sum\_\{k=1\}^\{K\}\\alpha\_\{k\}=\\sum\_\{k=1\}^\{K\}\(e\_\{k\}\+1\)=\\sum\_\{k=1\}^\{K\}e\_\{k\}\+K\(23\)
Under this exact mapping, the belief mass for a specific class and the total epistemic uncertainty are defined inversely to the total precision:
bk=ekα0andu=Kα0b\_\{k\}=\\frac\{e\_\{k\}\}\{\\alpha\_\{0\}\}\\quad\\text\{and\}\\quad u=\\frac\{K\}\{\\alpha\_\{0\}\}\(24\)We can trivially verify that this satisfies the foundational Subjective Logic constraint:∑bk\+u=∑ek\+Kα0=α0α0=1\\sum b\_\{k\}\+u=\\frac\{\\sum e\_\{k\}\+K\}\{\\alpha\_\{0\}\}=\\frac\{\\alpha\_\{0\}\}\{\\alpha\_\{0\}\}=1\.
In our NBSR framework, at the root node, the uniform prior𝛂0=𝟏\\bm\{\\alpha\}\_\{0\}=\\mathbf\{1\}yields a total precision ofα0=K\\alpha\_\{0\}=K\. Because no evidence has been extracted yet \(∑ek=0\\sum e\_\{k\}=0\), this results in maximum epistemic uncertainty \(u=1\.0u=1\.0\)\. As the model routes through the graph and accumulates strictly positive evidence vectors \(𝐞t\>0\\mathbf\{e\}\_\{t\}\>0\),α0\\alpha\_\{0\}monotonically increases \(Theorem[1](https://arxiv.org/html/2605.26147#Thmtheorem1)\)\. Consequently,uumathematically collapses toward zero\. This explicit metric allows the NBSR framework to natively trigger Out\-Of\-Distribution \(OOD\) safety abstentions simply by monitoring ifuuremains dangerously high during inference\.
### A\.6Differential Entropy and Uncertainty Reduction
To explicitly optimize the router for efficient decision\-making, we apply an intermediate penalty based on the differential entropy of the Dirichlet distribution\. The differential entropyℋ\(𝜶\)\\mathcal\{H\}\(\\bm\{\\alpha\}\)measures the uncertainty of the belief state and is computed analytically as:
ℋ\(𝜶\)=logB\(𝜶\)\+\(α0−K\)ψ\(α0\)−∑k=1K\(αk−1\)ψ\(αk\)\\mathcal\{H\}\(\\bm\{\\alpha\}\)=\\log\\text\{B\}\(\\bm\{\\alpha\}\)\+\(\\alpha\_\{0\}\-K\)\\psi\(\\alpha\_\{0\}\)\-\\sum\_\{k=1\}^\{K\}\(\\alpha\_\{k\}\-1\)\\psi\(\\alpha\_\{k\}\)\(25\)whereψ\(⋅\)\\psi\(\\cdot\)denotes the digamma function \(the logarithmic derivative of the Gamma function\)\.
In our NBSR framework, at the root node \(t=0t=0\), we initialize the state to𝛂0=𝟏\\bm\{\\alpha\}\_\{0\}=\\mathbf\{1\}, which maximizes this entropy equation, representing total structural ignorance\. By penalizingℋ\(𝛂t\)\\mathcal\{H\}\(\\bm\{\\alpha\}\_\{t\}\)during training, we force the routing agents to select paths that rapidly maximize information gain\. During inference, this precise entropy measurement serves as the gating mechanism for our dynamic early\-exiting policy; routing is halted the momentℋ\(𝛂t\)\\mathcal\{H\}\(\\bm\{\\alpha\}\_\{t\}\)falls below the confidence thresholdη\\eta\.
### A\.7Kullback\-Leibler Divergence between Two Dirichlet Distributions
The Kullback\-Leibler \(KL\) divergence measures the relative entropy or information lost when one Dirichlet distribution,Dir\(𝜷\)\\text\{Dir\}\(\\bm\{\\beta\}\), is used to approximate another,Dir\(𝜶\)\\text\{Dir\}\(\\bm\{\\alpha\}\), over the same\(K−1\)\(K\-1\)\-dimensional simplex\. It is calculated as:
DKL\(Dir\(𝜶\)∥Dir\(𝜷\)\)=logΓ\(∑k=1Kαk\)Γ\(∑k=1Kβk\)\+∑k=1K\[logΓ\(βk\)Γ\(αk\)\+\(αk−βk\)\(ψ\(αk\)−ψ\(∑j=1Kαj\)\)\]D\_\{\\text\{KL\}\}\(\\text\{Dir\}\(\\bm\{\\alpha\}\)\\parallel\\text\{Dir\}\(\\bm\{\\beta\}\)\)=\\log\\frac\{\\Gamma\\left\(\\sum\_\{k=1\}^\{K\}\\alpha\_\{k\}\\right\)\}\{\\Gamma\\left\(\\sum\_\{k=1\}^\{K\}\\beta\_\{k\}\\right\)\}\+\\sum\_\{k=1\}^\{K\}\\left\[\\log\\frac\{\\Gamma\(\\beta\_\{k\}\)\}\{\\Gamma\(\\alpha\_\{k\}\)\}\+\(\\alpha\_\{k\}\-\\beta\_\{k\}\)\\left\(\\psi\(\\alpha\_\{k\}\)\-\\psi\\left\(\\sum\_\{j=1\}^\{K\}\\alpha\_\{j\}\\right\)\\right\)\\right\]\(26\)
In our NBSR framework, the KL divergence provides a mathematically rigorous metric to quantify the exact informational impact of a single routing decision\. By calculatingDKL\(Dir\(𝛂t\+1\)∥Dir\(𝛂t\)\)D\_\{\\text\{KL\}\}\(\\text\{Dir\}\(\\bm\{\\alpha\}\_\{t\+1\}\)\\parallel\\text\{Dir\}\(\\bm\{\\alpha\}\_\{t\}\)\), we can measure the precise magnitude of the belief shift caused by the newly extracted evidence𝐞t\\mathbf\{e\}\_\{t\}\.
## Appendix BThe Gumbel Distribution
This section provides a formal overview of the Gumbel distribution, detailing its statistical properties and its critical role in the Gumbel\-Max trick, which forms the theoretical foundation for discrete stochastic routing\.
### B\.1Definition and Support
The Gumbel distribution \(specifically the Type I Extreme Value Distribution\) is a continuous probability distribution used to model the distribution of the maximum \(or minimum\) of a number of samples of various distributions\. It has continuous support over the entire real line,x∈\(−∞,∞\)x\\in\(\-\\infty,\\infty\)\. It is parameterized by a location parameterμ∈ℝ\\mu\\in\\mathbb\{R\}and a scale parameterβ\>0\\beta\>0\.
The probability density function \(PDF\) of the Gumbel distribution is defined as:
f\(x;μ,β\)=1βexp\(−x−μβ−exp\(−x−μβ\)\)f\(x;\\mu,\\beta\)=\\frac\{1\}\{\\beta\}\\exp\\left\(\-\\frac\{x\-\\mu\}\{\\beta\}\-\\exp\\left\(\-\\frac\{x\-\\mu\}\{\\beta\}\\right\)\\right\)\(27\)The cumulative distribution function \(CDF\) is given by:
F\(x;μ,β\)=exp\(−exp\(−x−μβ\)\)F\(x;\\mu,\\beta\)=\\exp\\left\(\-\\exp\\left\(\-\\frac\{x\-\\mu\}\{\\beta\}\\right\)\\right\)\(28\)
In the context of deep learning and our routing framework, we predominantly utilize the Standard Gumbel distribution, whereμ=0\\mu=0andβ=1\\beta=1\. The CDF thus simplifies toF\(x\)=exp\(−exp\(−x\)\)F\(x\)=\\exp\(\-\\exp\(\-x\)\)\.
### B\.2Mean and Variance
The expected value \(mean\) of a Gumbel\-distributed random variableX∼Gumbel\(μ,β\)X\\sim\\text\{Gumbel\}\(\\mu,\\beta\)is:
𝔼\[X\]=μ\+βγ\\mathbb\{E\}\[X\]=\\mu\+\\beta\\gamma\(29\)whereγ≈0\.5772\\gamma\\approx 0\.5772is the Euler\-Mascheroni constant\.
The variance is purely a function of the scale parameter:
Var\[X\]=π26β2\\text\{Var\}\[X\]=\\frac\{\\pi^\{2\}\}\{6\}\\beta^\{2\}\(30\)For the Standard Gumbel distribution, the mean is approximately0\.57720\.5772and the variance isπ26≈1\.645\\frac\{\\pi^\{2\}\}\{6\}\\approx 1\.645\.
### B\.3Extreme Value Theory and the Gumbel\-Max Trick
According to the Fisher\-Tippett\-Gnedenko theorem of extreme value theory\[[19](https://arxiv.org/html/2605.26147#bib.bib3),[17](https://arxiv.org/html/2605.26147#bib.bib4),[72](https://arxiv.org/html/2605.26147#bib.bib5),[14](https://arxiv.org/html/2605.26147#bib.bib6),[23](https://arxiv.org/html/2605.26147#bib.bib7)\], the Gumbel distribution is one of the three373737The other two are the Fréchet distribution, and the Weibull distribution\.possible limiting distributions for the maximum of a sequence of independent and identically distributed \(i\.i\.d\.\) random variables\.
This property gives rise to theGumbel\-Max trick\[[54](https://arxiv.org/html/2605.26147#bib.bib8),[40](https://arxiv.org/html/2605.26147#bib.bib9),[53](https://arxiv.org/html/2605.26147#bib.bib10)\], a mathematically rigorous method for drawing discrete samples from a Categorical distribution\. Let𝐩=\(p1,p2,…,pK\)\\mathbf\{p\}=\(p\_\{1\},p\_\{2\},\\dots,p\_\{K\}\)be the probabilities of a Categorical distribution, and letg1,g2,…,gKg\_\{1\},g\_\{2\},\\dots,g\_\{K\}be i\.i\.d\. samples drawn from a Standard Gumbel distribution\. We can draw a discrete categorical samplezz\(wherez∈\{1,…,K\}z\\in\\\{1,\\dots,K\\\}\) by applying theargmaxoperator to the perturbed log\-probabilities:
z=argmaxk∈\{1,…,K\}\(logpk\+gk\)z=\\arg\\max\_\{k\\in\\\{1,\\dots,K\\\}\}\(\\log p\_\{k\}\+g\_\{k\}\)\(31\)
In our NBSR framework, the router at nodevtv\_\{t\}generates logits𝐳t\\mathbf\{z\}\_\{t\}\. By adding Standard Gumbel noise to these logits before applying the argmax operation, we successfully instantiate a stochastic routing policy that samples paths proportionally to the network’s confidence, enabling robust exploration during training\.
## Appendix CThe Gumbel\-Softmax Continuous Relaxation
While the Gumbel\-Max trick perfectly models discrete stochastic sampling, theargmaxoperation is fundamentally non\-differentiable\. Its derivative is zero almost everywhere, which breaks the gradient flow required for backpropagation in deep neural networks\. To enable end\-to\-end training of our hierarchical decision graph, we employ theGumbel\-Softmax continuous relaxation\[[40](https://arxiv.org/html/2605.26147#bib.bib9),[53](https://arxiv.org/html/2605.26147#bib.bib10)\]\.
### C\.1Continuous Approximation of Argmax
The core insight of the Gumbel\-Softmax estimator is to replace the hard, non\-differentiableargmaxoperator with the smooth, differentiablesoftmaxfunction\.
Given unnormalized log\-probabilities \(logits\)𝐡=\(h1,…,hK\)\\mathbf\{h\}=\(h\_\{1\},\\dots,h\_\{K\}\)and i\.i\.d\. Standard Gumbel noise𝐠=\(g1,…,gK\)\\mathbf\{g\}=\(g\_\{1\},\\dots,g\_\{K\}\), the Gumbel\-Softmax estimator produces a continuousKK\-dimensional routing vector𝝅\\bm\{\\pi\}, where thekk\-th element is defined as:
πk=exp\(\(hk\+gk\)/τ\)∑i=1Kexp\(\(hi\+gi\)/τ\)\\pi\_\{k\}=\\frac\{\\exp\(\(h\_\{k\}\+g\_\{k\}\)/\\tau\)\}\{\\sum\_\{i=1\}^\{K\}\\exp\(\(h\_\{i\}\+g\_\{i\}\)/\\tau\)\}\(32\)Here,τ\>0\\tau\>0is thetemperaturehyperparameter\. Note that we apply this directly to the logits𝐡\\mathbf\{h\}generated by the router network at each node\.
### C\.2Temperature Annealing
The temperatureτ\\tauacts as a dial controlling the trade\-off between the smoothness of the gradients and the discreteness of the approximation:
- •High Temperature \(τ→∞\\tau\\to\\infty\):the distribution approaches a uniform continuous distribution, yielding highly stable but uninformative gradients\.
- •Low Temperature \(τ→0\\tau\\to 0\):the softmax function sharpens, and the continuous vector𝝅\\bm\{\\pi\}asymptotically approaches a discrete one\-hot vector exactly equivalent to the output of the Gumbel\-Max trick\.
During training, it is standard practice to annealτ\\taufrom a high initial value to a small strictly positive value\. This allows the routing agents to explore the graph smoothly in the early epochs before committing to hard, highly confident discrete paths as training converges\.
### C\.3The Straight\-Through Estimator \(STE\)
While the standard Gumbel\-Softmax function produces a continuous “soft” routing path \(e\.g\. sending80%80\\%of the signal left and20%20\\%right\), our framework explicitly requireshardconditional execution to maximize computational efficiency \(i\.e\. routing100%100\\%left and computing zero FLOPs on the right\)\.
To achieve this, we utilize theStraight\-Through Estimator383838A Straight\-Through Estimator \(STE\) is a technique used in neural networks to train models with non\-differentiable operations, such as quantization or binarization, by bypassing the gradient\-vanishing problem\. It works by using a discrete, non\-differentiable function in the forward pass but treating it as a continuous, identity function \(or similar mapping\) in the backward pass to calculate gradients\.\[[3](https://arxiv.org/html/2605.26147#bib.bib11),[40](https://arxiv.org/html/2605.26147#bib.bib9)\]variant of the Gumbel\-Softmax trick\. During theforward pass, we discretize the continuous output𝝅\\bm\{\\pi\}using a hardargmaxto produce a true one\-hot routing vector𝝅hard\\bm\{\\pi\}\_\{\\text\{hard\}\}\. This ensures that only a single active path is traversed, and unvisited branches consume zero computational resources\.
During thebackward pass, we bypass the non\-differentiableargmaxstep and route the gradients directly through the continuous Gumbel\-Softmax output𝝅\\bm\{\\pi\}\. Mathematically, this is implemented in modern autograd engines via the following detach operation:
𝝅out=\(𝝅hard−𝝅\)\.detach\(\)\+𝝅\\bm\{\\pi\}\_\{\\text\{out\}\}=\(\\bm\{\\pi\}\_\{\\text\{hard\}\}\-\\bm\{\\pi\}\)\\text\{\.detach\(\)\}\+\\bm\{\\pi\}\(33\)This formulation guarantees that the forward pass computation exactly mirrors the discrete logical path of the Bayesian tree, while the backward pass receives dense, well\-behaved gradients to update the routing parametersθvt\\theta\_\{v\_\{t\}\}and the global feature extractor393939The STE essentially creates a ”dual\-path” computational graph\. In the forward pass, the operation behaves like a step function \(discrete\), but in the backward pass, it behaves like an identity function or a smooth sigmoid \(continuous\), allowing the ”signal” of the error to reach the weights of the router\. This mechanism is what allows our Bayesian DAG to maintain the efficiency of a hard decision tree while still being trainable via standard backpropagation\.\.
This mechanism cleanly decouples the gradient flow between the specialized experts and the routing gates\. Because unselected experts are multiplied by exactly zero during the forward pass \(or bypassed entirely via sparse conditional logic\), they do not contribute to the final loss; consequently, they receive zero gradients, preserving their highly specialized weights and preventing representation collapse\. Conversely, because the STE routes the backward signal through the continuous distribution𝝅\\bm\{\\pi\}, the loss gradient flows back toallbranches of the router’s logits\. This allows the routing policy to evaluate the counterfactual probabilities of unselected branches and update its distribution accordingly, enabling the network to learn optimal pruning strategies without ever incurring the forward or backward computational cost of executing the unselected experts\.
## Appendix DDerivation of the Generalized Bias\-Variance Decomposition
In this section, we detail the derivation of the bias\-variance decomposition for the NLL risk, i\.e\. Eq\.[10](https://arxiv.org/html/2605.26147#S5.E10)in Section[5\.4](https://arxiv.org/html/2605.26147#S5.SS4)\.
Letxxbe a given input andyybe the corresponding target\. We denote the true conditional data distribution asP∗\(y\|x\)P^\{\*\}\(y\|x\)\. The predictive distribution produced by our neural routing graph, trained on a specific finite dataset𝒟\\mathcal\{D\}, is denoted asP𝒢\(y\|x;𝒟\)P\_\{\\mathcal\{G\}\}\(y\|x;\\mathcal\{D\}\)\.
The expected NLL risk for a specific inputxx, evaluated over all possible training datasets𝒟\\mathcal\{D\}and all possible targets drawn from the true distribution, is defined as:
ℛ\(x\)=𝔼𝒟\[𝔼y∼P∗\(⋅\|x\)\[−logP𝒢\(y\|x;𝒟\)\]\]\\mathcal\{R\}\(x\)=\\mathbb\{E\}\_\{\\mathcal\{D\}\}\\Big\[\\mathbb\{E\}\_\{y\\sim P^\{\*\}\(\\cdot\|x\)\}\\big\[\-\\log P\_\{\\mathcal\{G\}\}\(y\|x;\\mathcal\{D\}\)\\big\]\\Big\]\(34\)
To analyze the sources of error, we introduce theexpected predictive distributionP¯𝒢\(y\|x\)\\bar\{P\}\_\{\\mathcal\{G\}\}\(y\|x\), which represents the model’s average prediction over all possible datasets\. We can inject the true distributionP∗\(y\|x\)P^\{\*\}\(y\|x\)and the expected model distributionP¯𝒢\(y\|x\)\\bar\{P\}\_\{\\mathcal\{G\}\}\(y\|x\)into the logarithm via addition and subtraction:
−logP𝒢\(y\|x;𝒟\)\\displaystyle\-\\log P\_\{\\mathcal\{G\}\}\(y\|x;\\mathcal\{D\}\)=−logP∗\(y\|x\)\\displaystyle=\-\\log P^\{\*\}\(y\|x\)\+\(logP∗\(y\|x\)−logP¯𝒢\(y\|x\)\)\\displaystyle\\quad\+\\big\(\\log P^\{\*\}\(y\|x\)\-\\log\\bar\{P\}\_\{\\mathcal\{G\}\}\(y\|x\)\\big\)\+\(logP¯𝒢\(y\|x\)−logP𝒢\(y\|x;𝒟\)\)\\displaystyle\\quad\+\\big\(\\log\\bar\{P\}\_\{\\mathcal\{G\}\}\(y\|x\)\-\\log P\_\{\\mathcal\{G\}\}\(y\|x;\\mathcal\{D\}\)\\big\)\(35\)
We now substitute Eq\.[35](https://arxiv.org/html/2605.26147#A4.E35)back into the inner expectation overy∼P∗\(⋅\|x\)y\\sim P^\{\*\}\(\\cdot\|x\)from Eq\.[34](https://arxiv.org/html/2605.26147#A4.E34)\. Because the expectation is a linear operator, we can evaluate each of the three terms separately:
1\. The Irreducible Noise:the expectation of the first term depends entirely on the true distribution and representsthe inherent uncertainty in the data generation process\(the Shannon entropy\):
𝔼y∼P∗\[−logP∗\(y\|x\)\]=ℋ\(P∗\)\\mathbb\{E\}\_\{y\\sim P^\{\*\}\}\\big\[\-\\log P^\{\*\}\(y\|x\)\\big\]=\\mathcal\{H\}\(P^\{\*\}\)\(36\)
2\. The Bias:the expectation of the second term measures the distance between the true distribution and the model’s average prediction\. This is precisely the Kullback\-Leibler \(KL\) divergence:
𝔼y∼P∗\[logP∗\(y\|x\)−logP¯𝒢\(y\|x\)\]=DKL\(P∗∥P¯𝒢\)\\mathbb\{E\}\_\{y\\sim P^\{\*\}\}\\big\[\\log P^\{\*\}\(y\|x\)\-\\log\\bar\{P\}\_\{\\mathcal\{G\}\}\(y\|x\)\\big\]=D\_\{\\text\{KL\}\}\\big\(P^\{\*\}\\\|\\bar\{P\}\_\{\\mathcal\{G\}\}\\big\)\(37\)
3\. The Variance:The expectation of the third term captures how much the predictions from models trained on specific datasets fluctuate around the average model prediction:
𝔼y∼P∗\[logP¯𝒢\(y\|x\)−logP𝒢\(y\|x;𝒟\)\]=∑yP∗\(y\|x\)logP¯𝒢\(y\|x\)P𝒢\(y\|x;𝒟\)\\mathbb\{E\}\_\{y\\sim P^\{\*\}\}\\big\[\\log\\bar\{P\}\_\{\\mathcal\{G\}\}\(y\|x\)\-\\log P\_\{\\mathcal\{G\}\}\(y\|x;\\mathcal\{D\}\)\\big\]=\\sum\_\{y\}P^\{\*\}\(y\|x\)\\log\\frac\{\\bar\{P\}\_\{\\mathcal\{G\}\}\(y\|x\)\}\{P\_\{\\mathcal\{G\}\}\(y\|x;\\mathcal\{D\}\)\}\(38\)In standard generalized bias\-variance decompositions for likelihood estimators, it is common practice to approximate this final term by replacing the true distributionP∗\(y\|x\)P^\{\*\}\(y\|x\)with the expected model distributionP¯𝒢\(y\|x\)\\bar\{P\}\_\{\\mathcal\{G\}\}\(y\|x\)\. This substitution decouples the model’s internal variance from the external ground\-truth distribution, yielding a pure measure of model instability evaluated via KL divergence:
≈∑yP¯𝒢\(y\|x\)logP¯𝒢\(y\|x\)P𝒢\(y\|x;𝒟\)=DKL\(P¯𝒢∥P𝒢\(⋅;𝒟\)\)\\approx\\sum\_\{y\}\\bar\{P\}\_\{\\mathcal\{G\}\}\(y\|x\)\\log\\frac\{\\bar\{P\}\_\{\\mathcal\{G\}\}\(y\|x\)\}\{P\_\{\\mathcal\{G\}\}\(y\|x;\\mathcal\{D\}\)\}=D\_\{\\text\{KL\}\}\\big\(\\bar\{P\}\_\{\\mathcal\{G\}\}\\\|P\_\{\\mathcal\{G\}\}\(\\cdot;\\mathcal\{D\}\)\\big\)\(39\)
Final Decomposition:applying the outer expectation over all datasets𝔼𝒟\[⋅\]\\mathbb\{E\}\_\{\\mathcal\{D\}\}\[\\cdot\]to the aggregated terms yields the final formal decomposition of the expected risk:
𝔼𝒟\[𝔼y∼P∗\(⋅\|x\)\[−logP𝒢\(y\|x;𝒟\)\]\]≈ℋ\(P∗\)⏟Irreducible Noise\+DKL\(P∗∥P¯𝒢\)⏟Bias\+𝔼𝒟\[DKL\(P¯𝒢∥P𝒢\(⋅;𝒟\)\)\]⏟Variance\\mathbb\{E\}\_\{\\mathcal\{D\}\}\\Big\[\\mathbb\{E\}\_\{y\\sim P^\{\*\}\(\\cdot\|x\)\}\\big\[\-\\log P\_\{\\mathcal\{G\}\}\(y\|x;\\mathcal\{D\}\)\\big\]\\Big\]\\approx\\underbrace\{\\mathcal\{H\}\(P^\{\*\}\)\}\_\{\\text\{Irreducible Noise\}\}\+\\underbrace\{D\_\{\\text\{KL\}\}\\big\(P^\{\*\}\\\|\\bar\{P\}\_\{\\mathcal\{G\}\}\\big\)\}\_\{\\text\{Bias\}\}\+\\underbrace\{\\mathbb\{E\}\_\{\\mathcal\{D\}\}\\big\[D\_\{\\text\{KL\}\}\\big\(\\bar\{P\}\_\{\\mathcal\{G\}\}\\\|P\_\{\\mathcal\{G\}\}\(\\cdot;\\mathcal\{D\}\)\\big\)\\big\]\}\_\{\\text\{Variance\}\}\(40\)
This derivation explicitly maps the negative log\-likelihood objective of the routing graph to the structural constraints \(LLandW¯\\bar\{W\}\) discussed in Section[5\.4](https://arxiv.org/html/2605.26147#S5.SS4)\.
## Appendix EDifference Between NBSR and Classical Decision Trees
A fundamental distinction must be drawn between our NBSR framework and classical hierarchical models, such as Classification and Regression Trees \(CART\)\[[7](https://arxiv.org/html/2605.26147#bib.bib77)\], Soft Decision Trees \(SDTs\) or hierarchical Mixtures of Experts \(MoEs\)\[[42](https://arxiv.org/html/2605.26147#bib.bib78)\]\. The divergence lies in four critical architectural paradigms: the functional role of intermediate nodes, the mathematical viability of soft routing, the semantic nature of the input space, and the fundamental mechanism of information aggregation\.
##### 1\. The Role of Intermediate Nodes: Routing Gates vs\. Sequential Evidence Extractors
In classical differentiable decision trees and hierarchical MoEs, the graph is composed of two strictly distinct types of nodes:
- •Internal Nodes \(Routers\):these nodes do not produce any classification output\. They act purely as routing gates that calculate transitional probabilities \(e\.g\.0\.70\.7probability of routing left,0\.30\.3right\)\.
- •Terminal Leaves \(Experts\):these are the only nodes authorized to make a decision or output a class prediction\.
In contrast, NBSR abandons this dichotomy\. In our framework, each NBSR tree node, whether internal or terminal, consists of both a router and an evidence extractor \(see Diagram[1](https://arxiv.org/html/2605.26147#S4.F1)\)\. Consequently, intermediate nodes act asboth routers and experts\. When a sample arrives at an intermediate node \(e\.g\. the Depth 1animal\_midnode in the CIFAR\-10 classification task\), the model does not merely calculate the probability of traversing to thePetsorWildlifeleaves\. Instead, it immediately queries the global knowledge oracle𝐡x\\mathbf\{h\}\_\{x\}to extract a fullKK\-dimensional evidence vector \(Eq\.[5](https://arxiv.org/html/2605.26147#S4.E5)\):
𝐞mid=Activation\(𝐖mid𝐡x\+𝐛mid\)\\mathbf\{e\}\_\{\\text\{mid\}\}=\\text\{Activation\}\(\\mathbf\{W\}\_\{\\text\{mid\}\}\\mathbf\{h\}\_\{x\}\+\\mathbf\{b\}\_\{\\text\{mid\}\}\)This evidence is directly injected into the Dirichlet prior \(𝜶1=𝜶0\+𝐞mid\\bm\{\\alpha\}\_\{1\}=\\bm\{\\alpha\}\_\{0\}\+\\mathbf\{e\}\_\{\\text\{mid\}\}\)\. By updating the belief stateatthe intermediate nodes before passing the computational flow deeper into the tree, the NBSR framework allows the decision to sequentially sharpen\. This provides the mathematical foundation for dynamic early exiting, as the model can evaluate confidence att=1t=1without needing to reach a terminal leaf\.
##### 2\. The Failure of Soft Routing in Sequential Frameworks
The sequential nature of NBSR fundamentally breaks the standard ”soft routing” paradigm used by classical neural trees\.
In a classical hierarchical MoE, an image traverses every branch of the tree simultaneously\. The network calculates aPath Probabilityfor every terminal leaf by taking the product of the routing decisions along that branch\. For example:
πpets=P\(Animal∣Root\)×P\(Pets∣Animal\)\\pi\_\{\\text\{pets\}\}=P\(\\text\{Animal\}\\mid\\text\{Root\}\)\\times P\(\\text\{Pets\}\\mid\\text\{Animal\}\)\(41\)The final prediction is then calculated by multiplying every terminal leaf’s output by its respective path probability and summing them together to form a weighted continuous average:
𝐲final=∑all leavesπleaf⋅𝐲leaf\\mathbf\{y\}\_\{\\text\{final\}\}=\\sum\_\{\\text\{all leaves\}\}\\pi\_\{\\text\{leaf\}\}\\cdot\\mathbf\{y\}\_\{\\text\{leaf\}\}\(42\)
While this probability\-weighted average of ensembles works for classical treeswhere only leaves produce outputs, it fails catastrophically in the NBSR framework\. Because NBSR extracts evidence atmultiple depths, applying standard soft routing would result in a severe entanglement of intermediate and leaf evidence:
𝜶T=𝜶0\+∑mid\_nodesπmid𝐞mid\+∑leaf\_nodesπleaf𝐞leaf\\bm\{\\alpha\}\_\{T\}=\\bm\{\\alpha\}\_\{0\}\+\\sum\_\{\\text\{mid\\\_nodes\}\}\\pi\_\{\\text\{mid\}\}\\mathbf\{e\}\_\{\\text\{mid\}\}\+\\sum\_\{\\text\{leaf\\\_nodes\}\}\\pi\_\{\\text\{leaf\}\}\\mathbf\{e\}\_\{\\text\{leaf\}\}\(43\)If the router were allowed to output continuous probabilities \(e\.g\. a 30%/30%/40% split\), the final belief state would become a blended average of intermediate evidence layered on top of a blended average of leaf evidence\. This compounding continuous relaxation inevitably leads torepresentation collapse\. The local experts would fail to specialize, learning generic, overlapping features because the network relies on the blended ensemble to minimize the loss\.
By enforcinghard discrete routingvia the Gumbel\-Softmax Straight\-Through Estimator \(STE\), NBSR sidesteps this failure mode\. The STE forcesπ∈\{0,1\}\\pi\\in\\\{0,1\\\}\(i\.e\. ’hard routing’\), ensuring that exactly one clean, sequential path of evidence \(e\.g\.𝐞animal\_mid\+𝐞pets\\mathbf\{e\}\_\{\\text\{animal\\\_mid\}\}\+\\mathbf\{e\}\_\{\\text\{pets\}\}\) is extracted and added to the Dirichlet prior, thereby guaranteeing absolute expert specialization while retaining end\-to\-end differentiability\.
##### 3\. The Input Space: Raw Partitioning vs\. Semantic Oracle Querying
In classical hierarchical models like CART\[[7](https://arxiv.org/html/2605.26147#bib.bib77)\]and standard HMEs\[[42](https://arxiv.org/html/2605.26147#bib.bib78),[74](https://arxiv.org/html/2605.26147#bib.bib79)\], both the routing decisions and the expert predictions are typically computed directly from the raw input featuresxx\(e\.g\., via scalar thresholds or simple linear projections\)\. The tree explicitly partitions this raw input space\.
In contrast, NBSR introduces aPersistent Global Knowledge Oracle\(𝐡x\\mathbf\{h\}\_\{x\}\)\. A shared deep neural backbone first maps the raw input into a dense, high\-dimensional semantic space\. This oracle is then broadcast to the entire graph \- it’s presented to all nodes at all depths\. Consequently, the local experts in NBSR do not partition raw pixels or isolated tabular columns; instead, they act asactive attention mechanisms, applying specialized learned filters \(𝐖v\\mathbf\{W\}\_\{v\}\) to dynamically extract high\-level conceptual evidence directly from the shared global representation at various stages of the reasoning process\.
##### 4\. Aggregation Mechanism and Uncertainty Quantification: Multiplicative Probabilities vs\. Additive Evidence
In classical HMEs\[[42](https://arxiv.org/html/2605.26147#bib.bib78),[74](https://arxiv.org/html/2605.26147#bib.bib79)\], the final prediction is formed bymultiplyinggating probabilities along the path from root to leaf to form a path weight, which then weights the expert’s probability distribution\. Because this framework operates entirely within a zero\-sum probability space \(∑p=1\\sum p=1\), it forces the network to distribute a total probability mass of 1 across all outcomes, stripping the model of any native ability to quantify structural ignorance\.
In contrast, NBSR operates in anevidentialspace\. Instead of multiplying probabilities, NBSRaddscontinuous evidence vectors to a Dirichlet belief state \(𝜶t\+1=𝜶t\+𝐞t\\bm\{\\alpha\}\_\{t\+1\}=\\bm\{\\alpha\}\_\{t\}\+\\mathbf\{e\}\_\{t\}\)\. This additive accumulation breaks the zero\-sum constraint during the intermediate steps, enabling the model to physically inflate the total volume of evidence \(the Dirichlet precision,α0\\alpha\_\{0\}\)\. Consequently, NBSR natively provides rigorous Bayesian uncertainty quantification \(u=K/α0u=K/\\alpha\_\{0\}\)\. This allows the NBSR framework to explicitly express “I don’t know” when encountering out\-of\-distribution data \- a critical safety capability fundamentally absent in classical HMEs and traditional decision trees\.
## Appendix FFurther Results for CIFAR\-10 Classification
### F\.1Computing Environment and Experimental Setup
##### Computing Environment\.
All experiments, training regimes, and inference benchmarking were conducted on a high\-performance Linux server \(x86\_64 architecture\) equipped with a 24\-core \(48 logical threads\) AMD EPYC 9B45 processor and 190 GB of system memory\. GPU acceleration was provided by a single NVIDIA RTX PRO 6000 Blackwell Server Edition with 96 GB of VRAM, running CUDA version 13\.0 and NVIDIA Driver 580\.82\.07\. To ensure strict reproducibility across all baselines, identical deterministic seeds \(Seed = 111\) were applied to Python’s ‘random‘, ‘NumPy‘, and ‘PyTorch‘ libraries, alongside the enforcement of deterministic CuDNN backend algorithms\.
##### Network Architectures\.
Thevisual feature extraction backbonefor all models is an unfrozen ResNet\-18\[[28](https://arxiv.org/html/2605.26147#bib.bib71)\], initialized with ImageNet weights\. The network is truncated prior to its final classification head, yielding a 512\-dimensional global feature vector𝐡x\\mathbf\{h\}\_\{x\}for each input image\.
In our Neural Bayesian Sequential Routing \(NBSR\) framework, the routers and experts are instantiated as follows:
- \-Routers:each internal node acts as a gating mechanism parameterized by aMulti\-Layer Perceptron\(MLP\)\. The router concatenates the visual features𝐡x\\mathbf\{h\}\_\{x\}with the current Dirichlet concentration state𝜶t\\bm\{\\alpha\}\_\{t\}, passing them through a 128\-unit hidden layer with aReLU activation, followed by alinear projectionto the number of branching paths\. The output is sampled via a continuousGumbel\-Softmax approximationduring training\.
- \-Local Experts:each expert module \(e\.g\.animal\_mid,pets,wildlife\) consists of a singlelinear transformation\(ℝ512→10\\mathbb\{R\}^\{512\\to 10\}\) followed by aSoftplus activationfunction\. The Softplus nonlinearity is strictly required to ensure that the extracted evidence vector remains positive \(ek\>0e\_\{k\}\>0\), preserving the mathematical integrity of the Dirichlet parameter updates\.
##### Hyperparameters and Optimization\.
All models were trained for a total of 150 epochs with a batch size of 128\. We optimized the networks using theAdam optimizer\[[47](https://arxiv.org/html/2605.26147#bib.bib24)\]with an initial learning rate of2×10−42\\times 10^\{\-4\}, decayed by a factor ofγ=0\.5\\gamma=0\.5every 45 epochs via a StepLR scheduler\. Gradients were clipped at a maximum norm of1\.01\.0to ensure stability\.
For the data pipeline, the CIFAR\-10 images were upscaled to224×224224\\times 224to match the expected resolution of the ImageNet\-pretrained ResNet backbone\. Standard spatial augmentations were applied during training, includingrandom horizontal flipping,random rotation\(±15∘\\pm 15^\{\\circ\}\), andcolor jittering\(brightness=0\.2=0\.2\), followed by standardchannel\-wise normalization\.
For the NBSR specific hyperparameters, the structural entropy penalty weight was set toλ=10−3\\lambda=10^\{\-3\}\. To balance early\-stage structural exploration with late\-stage discrete routing, the Gumbel\-Softmax temperatureτ\\tauwas annealed exponentially per epoch from an initial value of1\.01\.0down to a minimum threshold of0\.10\.1at a decay rate of0\.970\.97\.
### F\.2Evaluation Metric: Expected Calibration Error \(ECE\)
The Expected Calibration Error \(ECE\) is a standard metric used to quantify how well a model’s predicted confidence aligns with its actual empirical accuracy\. Geometrically, it measures the aggregate absolute deviation from perfect calibration\. To calculate the ECE, the continuous probability space\[0,1\]\[0,1\]is partitioned intoMMequally spaced bins\. Each test sample is assigned to a specific bin based on its maximum predicted confidence score\. The final error is then calculated as the weighted average of the absolute difference between the true accuracy and the mean predicted confidence within each bin:
ECE=∑m=1M\|Bm\|N\|acc\(Bm\)−conf\(Bm\)\|\\text\{ECE\}=\\sum\_\{m=1\}^\{M\}\\frac\{\|B\_\{m\}\|\}\{N\}\\left\|\\text\{acc\}\(B\_\{m\}\)\-\\text\{conf\}\(B\_\{m\}\)\\right\|whereNNis the total number of evaluated samples,BmB\_\{m\}represents the set of samples whose predicted confidence falls into binmm,\|Bm\|\|B\_\{m\}\|is the total number of samples in that bin,acc\(Bm\)\\text\{acc\}\(B\_\{m\}\)is the empirical accuracy of those specific samples, andconf\(Bm\)\\text\{conf\}\(B\_\{m\}\)is their average predicted confidence\.
In our evaluations, we partition the probability space intoM=10M=10bins\. For the standard deterministic baselines \(Flat ResNet\-18 and Sparse MoE\), themaximum softmax probabilitiesare used as theconfidence scores\. For our NBSR framework, the structurally derived Bayesian expected marginals𝔼\[pk\]\\mathbb\{E\}\[p\_\{k\}\]\(Eq\.[cc\.Eq\.17](https://arxiv.org/html/2605.26147#S3.Ex3)\) are passed directly into this algorithm\. As summarized in the main text \(Table\.[2](https://arxiv.org/html/2605.26147#S6.T2)\), this evaluates to an ECE of 0\.015, mathematically demonstrating that the evidence\-based uncertainty of NBSR aligns much closer to empirical reality than the overconfident deterministic baselines\.
### F\.3Baseline Training Dynamics
To complement the quantitative results presented in the main text \(Section\.[6\.2](https://arxiv.org/html/2605.26147#S6.SS2)\), we provide the complete end\-to\-end training loss and test accuracy trajectories for theFlat ResNet\-18\(Fig\.[15](https://arxiv.org/html/2605.26147#A6.F15)\) and theSparse MoE\(Fig\.[16](https://arxiv.org/html/2605.26147#A6.F16)\) baselines over the 150\-epoch training regime\.
Notably, the Sparse MoE exhibits distinct variance and oscillation in its test accuracy, particularly in the later stages of training\. This instability is a classic symptom of the continuous soft\-routing gates competing against the classification experts, further highlighting the optimization friction introduced by standard MoE architectures on semantic classification tasks\.
Figure 15:Training loss and test accuracy dynamics for the standard Flat ResNet\-18 baseline on the CIFAR\-10 dataset\.Figure 16:Training loss and test accuracy dynamics for the Sparse MoE \(Soft Routing\) baseline on the CIFAR\-10 dataset\. The noticeable jitter in the accuracy curve illustrates the optimization friction inherent in continuous soft\-routing mechanisms\.
## Appendix GExperimental Details: Structured Medical Diagnosis
Here we provide the technical details, network architectures, and hyperparameters required to reproduce the structured tabular clinical experiments presented in Section\.[6\.3](https://arxiv.org/html/2605.26147#S6.SS3)\.
### G\.1Computing Environment
All experiments, including baseline training and NBSR evaluation, were executed on a cloud\-based virtual machine instance \(Google Colab\) to measure CPU\-bound inference efficiency\. The hardware specifications are as follows:
- •Processor:Intel\(R\) Xeon\(R\) CPU @ 2\.20GHz \(1 Physical Core, 2 Logical Threads\)
- •System Memory \(RAM\):13\.61 GB
- •Hardware Accelerator:none \(CPU\-only execution\)
- •Software Stack:Python 3\.12, PyTorch 2\.10\.0, and XGBoost 2\.0\.
### G\.2Dataset and Preprocessing
We utilized the Kaggle Disease Symptom Prediction dataset404040[https://www\.kaggle\.com/datasets/kaushil268/disease\-prediction\-using\-machine\-learning](https://www.kaggle.com/datasets/kaushil268/disease-prediction-using-machine-learning), which maps patient profiles to specific clinical endpoints, which maps patient profiles to specific clinical endpoints\.
- •Input Features \(XX\):132 binary indicators representing the presence \(1\) or absence \(0\) of specific clinical symptoms \(e\.g\. ‘skin\_rash‘, ‘joint\_pain‘\)\.
- •Target Classes \(YY\):41 discrete disease categories\.
- •Clinical Noise Injection:real\-world Electronic Health Records \(EHR\) are inherently noisy due to entry errors, missing tests, and ambiguous patient reporting\. To simulate this environment and prevent the models from overfitting to a perfectly deterministic toy dataset, we applied a uniform 5% noise mask to both the training and testing sets\. Specifically, for 5% of the feature matrixXX, the binary symptom states were inverted \(Xi,j←1−Xi,jX\_\{i,j\}\\leftarrow 1\-X\_\{i,j\}\)\.
### G\.3Network Architectures
##### Flat MLP Baseline\.
The standard monolithic deep learning baseline was constructed using a sequential stack of dense layers:
1. 1\.Linear\(132→\\to128\), ReLU activation
2. 2\.Linear\(128→\\to128\), ReLU activation
3. 3\.Linear\(128→\\to41\), Softmax activation
##### NBSR Architecture\.
The NBSR framework utilized a maximum depth of 2 \(Root→\\toMid→\\toLeaf\) with a uniform branching factor of 4\. The modular components were parameterized as follows:
- •Global Oracle Backbone:identical to the MLP hidden layers to ensure fair semantic capacity: Linear\(132→\\to128\)→\\toReLU→\\toLinear\(128→\\to128\)→\\toReLU\.
- •Router Networks:MLP processing the concatenated feature state𝐡x\\mathbf\{h\}\_\{x\}and current belief𝜶t\\bm\{\\alpha\}\_\{t\}: Linear\(169→\\to64\)→\\toReLU→\\toLinear\(64→\\to4\), followed by a Gumbel\-Softmax activation\.
- •Expert Networks \(Evidence Extractors\):linear\(128→\\to41\) followed by a Softplus activation to ensure strictly non\-negative evidence vectors𝐞t\>0\\mathbf\{e\}\_\{t\}\>0\.
### G\.4Optimization and Hyperparameters
Both the Flat MLP and NBSR networks were trained usingAdam optimizer\[[47](https://arxiv.org/html/2605.26147#bib.bib24)\]with a learning rate of1×10−31\\times 10^\{\-3\}and a batch size of 128 for 40 epochs\. A global random seed of 111 was enforced across data splits, noise injection, and network initializations\.
For the XGBoost baseline, we utilized the default ‘XGBClassifier‘ parameters with ‘eval\_metric=’mlogloss’‘ and ‘use\_label\_encoder=False‘\.
Specific NBSR hyperparameters were set as follows:
- •Routing Temperature \(τ\\tau\):the Gumbel\-Softmax temperature was exponentially annealed per epoch to smoothly transition from soft exploration to hard, discrete routing:τ=max\(0\.1,1\.0⋅0\.9epoch\)\\tau=\\max\(0\.1,1\.0\\cdot 0\.9^\{\\text\{epoch\}\}\)\.
- •Entropy Penalty \(λ\\lambda\):to prevent artificial overconfidence on the heavily perturbed noisy clinical data, the explicit entropy minimization penalty was disabled \(λ=0\.0\\lambda=0\.0\)\. The network relied purely on the Negative Log\-Likelihood \(NLL\) of the expected Dirichlet probability\.
- •Early Exiting Threshold \(η\\eta\):during inference, the “Deep” configuration bypassed early exiting \(threshold evaluated as ‘None‘\), forcing the full architectural depth\. The “Fast” configuration utilized a differential entropy threshold ofη=−100\.0\\eta=\-100\.0to dynamically truncate routing on unambiguous clinical presentations\.
## Appendix HExperimental Details: Language Modeling
Here we provide the technical details, network architectures, and hyperparameters required to reproduce the synthetic language modeling and contextual disambiguation experiments presented in Section\.[6\.4](https://arxiv.org/html/2605.26147#S6.SS4)\.
### H\.1Computing Environment
All sequence modeling experiments were deliberately executed on a cloud\-based virtual machine instance \(Google Colab\)\. The hardware specifications are as follows:
- •Processor:Intel\(R\) Xeon\(R\) CPU @ 2\.20GHz \(1 Physical Core, 2 Logical Threads\)
- •System Memory \(RAM\):13\.61 GB
- •Hardware Accelerator:none \(CPU\-only execution\)
- •Software Stack:Python 3\.12, PyTorch 2\.1\.0
### H\.2Dataset and Preprocessing
To perfectly isolate syntactic reasoning from semantic world\-knowledge, we constructed a synthetic contextual disambiguation corpus\.
- •Vocabulary \(\|𝒱\|\|\\mathcal\{V\}\|\):65 discrete tokens, strictly partitioned into 4 special tokens \(<pad\>,<bos\>,<eos\>,<unk\>\), 19 function words, 14 modifiers, 14 abstract nouns, and 14 concrete entities\.
- •Generative Templates:sequences were procedurally generated using uniform random sampling across 5 predefined syntactic templates \(e\.g\.\[’function’, ’concrete’, ’function’, ’function’, ’abstract’\]\)\.
- •Data Splits:The dataset was split into 10,000 training sequences and 2,000 held\-out test sequences\. All sequences were padded to a maximum length of 6 tokens\.
### H\.3Network Architectures
##### Shared Causal Transformer Backbone\.
All three evaluated models \(Standard, MoE, NBSR\) share an identical Transformer backbone to ensure fair semantic capacity\. It consists of an embedding layer \(dmodel=64d\_\{model\}=64\) with positional encoding, followed by a 2\-layer causal Transformer Encoder utilizing 4 attention heads and a feed\-forward dimension of 256\.
##### Standard Transformer Baseline\.
The monolithic baseline processes the backbone’s contextual representation𝐡x\\mathbf\{h\}\_\{x\}through a standard dense linear projection: Linear\(64→\\to65\), followed by a standard Log\-Softmax activation\.
##### Transformer MoE Baseline\.
The MoE baseline utilizes a discrete Top\-1 routing head without evidence accumulation\. It consists of a router network—Linear\(64→\\to4\) with Softmax, and 4 independent expert networks parameterized as Linear\(64→\\to65\)\.
##### NBSR Architecture\.
The NBSR framework utilizes a hierarchical Directed Acyclic Graph \(DAG\) with a maximum depth of 2\. The modular components were parameterized as follows:
- •Root Router:an MLP processing the concatenated state𝐡x\\mathbf\{h\}\_\{x\}and the uniform Dirichlet prior𝜶0\\bm\{\\alpha\}\_\{0\}: Linear\(129→\\to32\)→\\toReLU→\\toLinear\(32→\\to2\), followed by a Gumbel\-Softmax activation\.
- •Mid\-Level Experts \(Syntax vs\. Semantics\):2 distinct experts, each structured as Linear\(64→\\to64\)→\\toReLU→\\toLinear\(64→\\to65\), followed by aSoftplusactivation to ensure strictly non\-negative evidence vectors𝐞t\>0\\mathbf\{e\}\_\{t\}\>0\.
- •Leaf Experts \(PoS Categories\):4 terminal experts mapping to the distinct linguistic sub\-spaces \(Function, Modifier, Abstract, Concrete\)\. These are identical in structure to the mid\-level experts\.
### H\.4Optimization and Hyperparameters
All networks were trained using theAdam optimizer\[[47](https://arxiv.org/html/2605.26147#bib.bib24)\]with a learning rate of1×10−31\\times 10^\{\-3\}and a batch size of 64 for 5 epochs\. A global random seed of 42 was enforced across data generation and network initializations\.
Specific NBSR hyperparameters were set as follows:
- •Routing Temperature \(τ\\tau\):the Gumbel\-Softmax temperature was exponentially annealed per epoch to smoothly transition from soft exploration to hard, discrete routing:τ=max\(0\.1,1\.0⋅0\.9epoch\)\\tau=\\max\(0\.1,1\.0\\cdot 0\.9^\{\\text\{epoch\}\}\)\.
- •Early Exiting Threshold \(η\\eta\):during inference, the “Deep” configuration bypassed early exiting \(threshold evaluated as ‘None‘\), forcing the full architectural depth\. The “Fast” configuration utilized a differential entropy threshold ofη=1\.0\\eta=1\.0to dynamically halt routing on predictable function words\.
- •OOD Abstention Threshold \(τconf\\tau\_\{conf\}\):the critical confidence threshold for evaluating the epistemic uncertainty of the total Dirichlet precision on out\-of\-distribution prompts was defined mathematically asτconf=1\.5×\|𝒱\|=97\.5\\tau\_\{conf\}=1\.5\\times\|\\mathcal\{V\}\|=97\.5\.
## Appendix IExperimental Setup for the POMDP Navigation Task
Here we provide detailed specifications for the experimental environment, dataset generation, model architectures, and training hyperparameters utilized in the sequential control and planning task \(Section[6\.5](https://arxiv.org/html/2605.26147#S6.SS5)\)\.
### I\.1Computing Environment
All experiments, training, and evaluations were conducted on a Google Colab virtual instance\. To ensure reproducibility and demonstrate the lightweight nature of the NBSR\-Mem architecture, the experiments were executed entirely on the CPU without hardware acceleration\. The environment specifications are as follows:
- •CPU:AMD EPYC 7B12 \(1 physical core, 2 logical threads\), 2\.25 GHz base clock\.
- •Memory:13\.61 GB RAM\.
- •Software Stack:Python 3\.12\.13, PyTorch 2\.10\.0\+cpu\.
### I\.2POMDP Dataset Generation
The navigation task was framed as a classification problem via Behavioral Cloning\. We procedurally generated a synthetic dataset of optimal state\-action sequences representing a Partially Observable Markov Decision Process \(POMDP\)\.
##### State Representation:
each state observationoto\_\{t\}is a4×5×54\\times 5\\times 5spatial tensor representing a local grid centered on the agent\. The 4 channels encode distinct environmental features \(e\.g\. floor, agent position, left\-turn cues, right\-turn cues\)\.
##### Trajectory Structure:
each sequence has a fixed length ofT=4T=4timesteps, structured to enforce reliance on temporal memory:
1. 1\.t=0t=0\(Memory Cue\):the agent receives a cue dictating the ultimate turning direction\. A random variable determines if the sequence is “simple” \(agent moves straight indefinitely\) or “complex” \(agent must evade a wall at the end\)\. For complex sequences, a target direction \(Left or Right\) is selected, and a corresponding visual cue is instantiated in the spatial tensor\.
2. 2\.t=1,2t=1,2\(Spatial Amnesia Zone\):the agent traverses a featureless corridor\. The optimal action is 0 \(Cruising/Straight\)\. The visual cue fromt=0t=0is no longer present in the observation\.
3. 3\.t=3t=3\(Intersection\):if the sequence is complex, a wall appears directly ahead of the agent\. The optimal action is the target direction designated att=0t=0\. If the sequence is simple, the corridor remains empty, and the optimal action remains 0\.
##### Out\-of\-Distribution \(OOD\) Generation:
To test epistemic safety, an “Alien Hazard” was generated by injecting a high\-magnitude activation into a specific coordinate of the fourth visual channel \- a feature entirely absent from the training distribution\.
##### Dataset Splits:
The generated dataset comprised 5,000 training sequences and 1,000 test sequences\. The probability of a sequence being “complex” \(requiring a turn\) was set to 70% to heavily penalize purely reactive models\.
### I\.3Model Architectures
All models share a commonGlobal Knowledge Oracle\(CNN Backbone\) to ensure a fair comparison of routing and memory capabilities\.
##### CNN Backbone:
The feature extractor consists of two convolutional layers followed by a linear projection:
- •Conv2d: 4 input channels, 32 output channels, kernel size3×33\\times 3, padding 1\.
- •Conv2d: 32 input channels, 64 output channels, kernel size3×33\\times 3, padding 0\.
- •Linear: 576 \(64×3×364\\times 3\\times 3\) to 128 dimensions\.
- •LayerNorm: applied to the 128\-dimensional output without elementwise affine parameters to strictly anchor the latent space\.
- •Activations:ReLU is applied after all convolutional and linear layers\.Note: Biases were removed from all layers to improve evidential calibration\.
##### Action Space and Network Outputs:
the theoretical action space𝒜\\mathcal\{A\}is defined by four discrete movements: 0 \(Cruising/Straight\), 1 \(Reverse\), 2 \(Evasion Left\), and 3 \(Evasion Right\)\. To enable gradient\-based optimization, the network architectures output continuous 4\-dimensional vectors representing their predictive confidence\. For the standard baselines \(CNN and CNN\-GRU\), the final classifier outputs a vector of log\-probabilities via a Log\-Softmax activation\. Conversely, the NBSR architectures output a vector representing the concentration parameters of a Dirichlet distribution,𝜶=𝐞\+𝟏\\bm\{\\alpha\}=\\mathbf\{e\}\+\\mathbf\{1\}, where𝐞≥0\\mathbf\{e\}\\geq 0denotes the accumulated evidence for each action\. This distinction is critical for safety: while a highly uncertain baseline is mathematically forced to distribute probabilities that sum to 1 \(e\.g\.\[0\.25,0\.25,0\.25,0\.25\]\[0\.25,0\.25,0\.25,0\.25\]\), the NBSR model can output𝜶=\[1\.0,1\.0,1\.0,1\.0\]\\bm\{\\alpha\}=\[1\.0,1\.0,1\.0,1\.0\], explicitly quantifying a state of total epistemic uncertainty \(zero evidence\) to trigger a safe halt\. During training, the discrete expert actionat∗a^\{\*\}\_\{t\}is transformed into a one\-hot vector to properly isolate the penalty applied by the masked evidential regularizer\.
##### Memory Module \(CNN\-GRU & NBSR\-Mem\):
for models equipped with temporal memory, the 128\-dimensional output of the CNN backbone is processed sequentially by a standard Gated Recurrent Unit \(nn\.GRU\) with an input size of 128, a hidden size of 128, and ‘batch\_first=True‘\.
##### Hierarchical Evidential DAG \(NBSR & NBSR\-Mem\):
the routing topology consists of three levels:
- •Root Router:a singleLinear\(128, 2\)layer mapping the hidden state to the two abstract modes \(Cruising vs\. Evasion\)\.
- •Mid\-Level Experts:twoLinear\(128, 4\)layers \(without biases\), one for each abstract mode\. These extract initial Dirichlet evidence \(𝐞mid\\mathbf\{e\}\_\{mid\}\)\.
- •Leaf Experts:fourLinear\(128, 4\)layers \(without biases\)\. These extract additive Dirichlet evidence \(𝐞leaf\\mathbf\{e\}\_\{leaf\}\) if the Depth 1 entropy remains above the threshold\.
During training, the Root Router utilizes a standard Softmax activation\. During inference, it uses an argmax \(one\-hot\) to enable discrete early exits\. All experts utilize a Softplus activation to ensure non\-negative evidence generation\.
### I\.4Training Protocol and Hyperparameters
All models were trained via Behavioral Cloning\[[62](https://arxiv.org/html/2605.26147#bib.bib43)\]using the Adam optimizer\[[47](https://arxiv.org/html/2605.26147#bib.bib24)\]\. To stabilize the Recurrent Evidential Routing Network, we employed a specific combination of losses and regularizers\.
##### Hyperparameters:
- •Batch Size:64
- •Epochs:30
- •Learning Rate:2×10−32\\times 10^\{\-3\}
- •Weight Decay:2×10−32\\times 10^\{\-3\}appliedexclusivelyto the CNN Backbone parameters\.
- •Base Dirichlet Prior \(𝛂0\\bm\{\\alpha\}\_\{0\}\):𝟏\\mathbf\{1\}\(a tensor of ones\)\.
- •Inference Entropy Threshold \(η\\eta\):−4\.5\-4\.5\(used for full dataset evaluation to enforce deep traversal on intersections; \-0\.5 for the single OOD trace\)\.
- •OOD Abstention Threshold \(τconf\\tau\_\{conf\}\):10\.010\.0
##### Loss Formulation \(NBSR Models\):
the training objective is a composite of three terms:
ℒtotal=ℒNLL\+ℒreg\+λrouteℒroute\\mathcal\{L\}\_\{total\}=\\mathcal\{L\}\_\{NLL\}\+\\mathcal\{L\}\_\{reg\}\+\\lambda\_\{route\}\\mathcal\{L\}\_\{route\}\(44\)
1. 1\.Negative Log\-Likelihood \(ℒNLL\\mathcal\{L\}\_\{NLL\}\):computed using the expected probability of the final Dirichlet distribution:𝐩=𝜶T/∑𝜶T\\mathbf\{p\}=\\bm\{\\alpha\}\_\{T\}/\\sum\\bm\{\\alpha\}\_\{T\}\.
2. 2\.Auxiliary Concept Loss \(ℒroute\\mathcal\{L\}\_\{route\}\):a standard Cross\-Entropy loss applied to the Root Router logits to enforce semantic mapping\. We map the optimal actionat∗∈\{0,1,2,3\}a^\{\*\}\_\{t\}\\in\\\{0,1,2,3\\\}to an abstract target modemt∗∈\{0,1\}m^\{\*\}\_\{t\}\\in\\\{0,1\\\}corresponding to the hierarchical routing tree\. Specifically,mt∗=0m^\{\*\}\_\{t\}=0ifat∗∈\{0,1\}a^\{\*\}\_\{t\}\\in\\\{0,1\\\}\(Cruising/Straight/Reverse\) andmt∗=1m^\{\*\}\_\{t\}=1ifat∗∈\{2,3\}a^\{\*\}\_\{t\}\\in\\\{2,3\\\}\(Evasion/Left/Right\)\. The loss is defined as: ℒroute=−∑j=01𝕀\(mt∗=j\)log\(proot,j\)\\mathcal\{L\}\_\{route\}=\-\\sum\_\{j=0\}^\{1\}\\mathbb\{I\}\(m^\{\*\}\_\{t\}=j\)\\log\(p\_\{root,j\}\)\(45\)whereproot,jp\_\{root,j\}is the predicted probability of modejjfrom the Root Router\. The weighting factor was set toλroute=0\.1\\lambda\_\{route\}=0\.1\.
3. 3\.Masked Evidential Regularization \(ℒreg\\mathcal\{L\}\_\{reg\}\):to penalize hallucinated evidence without causing gradient starvation, we utilized the target\-masked regularizer\[[69](https://arxiv.org/html/2605.26147#bib.bib28)\]: ℒreg=λreg\(e\)1K∑i=1K\(αT,i−1\)\(1−yi\)\\mathcal\{L\}\_\{reg\}=\\lambda\_\{reg\}^\{\(e\)\}\\frac\{1\}\{K\}\\sum\_\{i=1\}^\{K\}\(\\alpha\_\{T,i\}\-1\)\(1\-y\_\{i\}\)\(46\)where𝐲\\mathbf\{y\}is the one\-hot encoded target action\. To allow the GRU time to form initial memory pathways, we applied Evidential Annealing, linearly ramping the regularizer weightλreg\(e\)\\lambda\_\{reg\}^\{\(e\)\}from0\.00\.0to0\.020\.02over the first 10 epochs\.
## Appendix JExperimental Details for BOED Active Clinical Triage
Here we provide the technical specifications and hyperparameter configurations required to reproduce the Active Clinical Triage experiments presented in Section[6\.6](https://arxiv.org/html/2605.26147#S6.SS6)\.
### J\.1Computing Environment
All experiments were executed on a cloud\-based virtual machine instance \(KVM full virtualization\) to specifically isolate and measure CPU\-bound inference efficiency\. The exact hardware specifications are as follows:
- •Processor:Intel\(R\) Xeon\(R\) CPU @ 2\.20GHz, x86\_64 architecture \(1 Physical Core, 2 Logical Threads\) with 55 MiB L3 cache\.
- •System Memory \(RAM\):13\.61 GB
- •Storage:242\.49 GB allocated disk space
- •Hardware Accelerator:None \(Strict CPU\-only execution; no GPU utilized\)
### J\.2Dataset and Preprocessing
We utilized the Kaggle Disease Symptom Prediction dataset \(same as used in Section[6\.3](https://arxiv.org/html/2605.26147#S6.SS3)\), mapping clinical symptom profiles to 41 distinct pathologies\.
- •Input Features \(XX\):132 binary indicators representing symptom presence \(1\) or absence \(0\)\.
- •Partitioning:Features were partitioned into 12 base demographic features \(Always visible\) and 5 Diagnostic Test Panels, each containing 24 task\-specific symptom indicators\.
- •Clinical Noise:A uniform 5% noise mask was applied to the feature matrix, where binary states were stochastically inverted \(0↔10\\leftrightarrow 1\)\.
### J\.3Network Architectures
##### AR\-NBSR Architecture\.
- •Global Router:An MLP processing the concatenated state\[𝐡x,𝜶t,𝐦t\]\[\\mathbf\{h\}\_\{x\},\\bm\{\\alpha\}\_\{t\},\\mathbf\{m\}\_\{t\}\], where𝐦t\\mathbf\{m\}\_\{t\}is the one\-hot history of previously queried panels\. Architecture: Linear\(178→\\to128\)→\\toLayerNorm→\\toReLU→\\toLinear\(128→\\to5\)\.
- •Expert Networks \(Panels\):Independent feature extractors: Linear\(36→\\to128\)→\\toLayerNorm→\\toReLU→\\toLinear\(128→\\to128\)→\\toReLU→\\toLinear\(128→\\to41\)→\\toSoftplus\.
### J\.4Optimization and Training Dynamics
Networks were optimized using Adam\[[47](https://arxiv.org/html/2605.26147#bib.bib24)\]\(lr=10−3\\text\{lr\}=10^\{\-3\}\) for 200 epochs with a batch size of 128\. AReduceLROnPlateauscheduler \(factor=0\.5,patience=10\\text\{factor\}=0\.5,\\text\{patience\}=10\) was employed to stabilize convergence\.
##### Hyperparameters\.
- •Entropy Penalty \(λ\\lambda\):1×10−41\\times 10^\{\-4\}\.
- •Budget Penalty \(γ\\gamma\):0\.0050\.005\.
- •Gumbel\-Softmax Temperature \(τ\\tau\):Exponentially annealed from1\.01\.0to0\.10\.1\.
- •Training Early Exiting \(η\\eta\):−150\.0\-150\.0\.
- •Evaluation Sweep \(η\\eta\):For the Pareto Frontier, we evaluated a dense grid of 30 linearly spaced confidence thresholdsη∈\[−110\.0,−400\.0\]\\eta\\in\[\-110\.0,\-400\.0\]\.Similar Articles
Position: The Time for Sampling Is Now! Charting a New Course for Bayesian Deep Learning
This position paper argues that sampling-based inference in Bayesian neural networks has achieved computational parity with optimization-based methods and is poised to supersede them, offering superior uncertainty quantification and prediction performance.
Evidence-Guided Neural Architecture Selection under Uncertainty for Subject-Specific Blood Glucose Forecasting
Proposes EVIDENT, a framework that integrates Bayesian training and evidence-based ranking for neural architecture selection, demonstrated on subject-specific blood glucose forecasting in type 1 diabetes, systematically selecting low-capacity models that generalize reliably.
Learning Agent Routing From Early Experience
This paper introduces BoundaryRouter, a training-free framework that optimizes LLM agent usage by routing queries to either lightweight inference or full agent execution based on early experience. It also presents RouteBench, a benchmark for evaluating routing performance, showing significant improvements in speed and accuracy.
Neetyabhas: A Framework for Uncertainty-Aware Public Policy Optimization in Rational Agent-Based Models
Neetyabhas is a framework for uncertainty-aware public policy optimization using hierarchical reinforcement learning agents in agent-based epidemic simulations. It models individual behaviors (mask-wearing, vaccination, shopping) and policymaker interventions under uncertainty, demonstrating effective COVID-19 outbreak management.
Scaling Continual Learning to 300+ Tasks with Bi-Level Routing Mixture-of-Experts
This paper introduces CaRE, a novel continual learning framework using a bi-level routing mixture-of-experts mechanism to effectively handle class-incremental learning over sequences of 300+ tasks.