Sigma-Branch: Hierarchical Single-Path Network Reconstruction for Dynamic Inference with Reduced Active Parameters
Summary
Sigma-Branch restructures pretrained dense networks into a hierarchical binary tree with a shared backbone, routers, and specialized leaves, reducing per-inference active parameters by 58–60% while staying within 1.72 pp of baseline accuracy on CIFAR-100, ImageNet-1K, and ModelNet40.
View Cached Full Text
Cached at: 06/10/26, 06:18 AM
# Sigma-Branch: Hierarchical Single-Path Network Reconstruction for Dynamic Inference with Reduced Active Parameters
Source: [https://arxiv.org/html/2606.09924](https://arxiv.org/html/2606.09924)
Kohga Tanaka and Hiroaki NishiThis work was supported by the JSPS KAKENHI \(Grant Number JP26K02884\)\. The authors also gratefully acknowledge support from the JST SIP Project \(Grant Number JPJ012207\)\.K\. Tanaka is with the Graduate School of Science and Technology, Keio University, Kohoku\-ku, Yokohama 223\-8522, Japan \(e\-mail: tanaka@west\.sd\.keio\.ac\.jp\)\.H\. Nishi is with the Department of System Design, Faculty of Science and Technology, Keio University, Kohoku\-ku, Yokohama 223\-8522, Japan\.
###### Abstract
Deploying deep neural networks on memory\-constrained edge accelerators is bottlenecked by per\-inference off\-chip weight transfer rather than computation: the dense network cannot be retained on\-chip, and every parameter must be loaded for every input\. Existing model compression reduces this transfer only at the cost of permanent capacity loss\. We propose Sigma\-Branch \(Σ\\SigmaB\), a framework that restructures a pretrained dense network into a hierarchical binary tree composed of a shared backbone, hierarchical routers and specialized leaves\. Pretrained weights are distributed across the tree via activation\-based sphericalkk\-means clustering, which jointly initializes router weights and per\-branch channel allocations; soft\-routing fine\-tuning then aligns each leaf with its routed input subset\. At inference, the resulting network executes only a single root\-to\-leaf path, reducing the active\-parameter footprint while storing the complete dense parameter set in memory\. Across CIFAR\-100 / ResNet\-50, ImageNet\-1K / ResNet\-50, and ModelNet40 / PointNet\+\+,Σ\\SigmaB\-Net reduces per\-inference active parameters by 58–60 % while remaining within 1\.72 percentage points \(pp\) of the dense baseline Top\-1\. At comparable ImageNet\-1K Top\-1, the active\-parameter reduction exceeds static structured pruning \(FPGM, HRank\) by 14–23 pp\. The cross\-modal evaluation, spanning 2D vision and 3D point\-cloud backbones, substantiates a framework\-level claim that decouples per\-inference memory traffic from the total parameter count\.
## IIntroduction
### I\-ABackground
#### I\-A1Scaling of Deep Learning Models
Deep neural networks \(DNNs\) have advanced markedly through scaling: larger models trained with more data and compute consistently achieve higher accuracy, from convolutional backbones such as ResNet\[[1](https://arxiv.org/html/2606.09924#bib.bib1)\]to Vision Transformers \(ViT\) of several hundred million parameters\[[2](https://arxiv.org/html/2606.09924#bib.bib2)\]\. These models are designed under the assumption of data\-center\-grade compute and are not natively suited to environments outside that regime\.
#### I\-A2Rising Importance of Edge AI
Concurrent with this scaling trend, an increasing share of inference must be performed at the edge of the network, on devices ranging from mobile phones to FPGAs and microcontrollers\[[3](https://arxiv.org/html/2606.09924#bib.bib3)\], for reasons that include low latency, on\-device privacy, and network\-free operation\. The compute and memory capacity of such platforms are orders of magnitude smaller than those of data\-center systems, and inference typically runs at a batch size of one, which together make direct deployment of most large\-scale models impractical\. Vision models for the edge, therefore, remain predominantly CNN\-based \(e\.g\. ResNet\[[1](https://arxiv.org/html/2606.09924#bib.bib1)\], MobileNet\[[4](https://arxiv.org/html/2606.09924#bib.bib4)\], ConvNeXt\[[5](https://arxiv.org/html/2606.09924#bib.bib5)\]\) or MLP\-based for 3D data\[[6](https://arxiv.org/html/2606.09924#bib.bib6),[7](https://arxiv.org/html/2606.09924#bib.bib7)\], and even these backbones continue to scale up in pursuit of accuracy, intensifying the compute and memory demands placed on edge devices\.
### I\-BProblem Statement
#### I\-B1Memory\-Constrained Accelerators on the Edge
Modern DNNs exhibit a*static*computation structure: every parameter is typically accessed for every input, regardless of input difficulty\. As a result, each inference requires loading the entire set of weights from memory for models that exceed the on\-chip memory capacity\.
Meanwhile, the edge increasingly hosts accelerators beyond GPUs—FPGAs, NPUs, Edge TPUs, and microcontrollers—where on\-chip memory \(e\.g\., BRAM, SRAM\) is orders of magnitude smaller than that of data\-center systems\. Most practical models do not fit entirely on\-chip, making weight loading from off\-chip DRAM a critical bottleneck on these platforms\.
On such memory\-constrained devices, weight loading directly governs inference latency: the volume of weights loaded per inference is limited by available memory bandwidth and on\-chip capacity\. Recent work has shown that for large DNNs at small batch sizes, weight loading time dominates while compute units remain largely idle\[[8](https://arxiv.org/html/2606.09924#bib.bib8)\]\. Edge GPUs face the same issue: for example, the NVIDIA Jetson Orin Nano provides 8 GB of on\-board memory at 68 GB/s of bandwidth, whereas a data\-center NVIDIA H100 \(SXM5\) offers 80 GB at 3\.35 TB/s—an order of magnitude difference in capacity and roughly two orders of magnitude in bandwidth\. The batch\-size\-one regime typical of real\-time edge applications further exacerbates this bottleneck\.
We therefore identify the reduction of*active parameter memory*, i\.e\., the parameters that must be loaded for a single inference, as a key consideration for efficient inference on memory\-constrained edge accelerators\.
#### I\-B2Model Reconstruction for Reduced Memory Footprint
Existing model compression techniques—pruning, knowledge distillation, and quantization—reduce the total parameter count permanently\[[9](https://arxiv.org/html/2606.09924#bib.bib9),[10](https://arxiv.org/html/2606.09924#bib.bib10)\]\. However, this permanent reduction can degrade the model’s representational capacity\.
A complementary direction is to exploit input\-dependent sparsity in DNN inference\. Empirical evidence indicates that the effective subset of computation varies with the input: DeepMoE shows that the useful channel subset within a convolutional layer is input\-dependent\[[11](https://arxiv.org/html/2606.09924#bib.bib11)\], and SkipNet shows that whole residual blocks can be skipped per input\[[12](https://arxiv.org/html/2606.09924#bib.bib12)\]\. Using the full parameter set for every inference is therefore inherently redundant\.
A branched architecture is a natural way to capitalize on this observation while*decoupling capacity from compute*\. Each branch can specialize to a subset of the input distribution, so the model retains the full dense parameter set rather than permanently removing parameters; meanwhile, at inference, only the backbone plus a single branch is active, directly reducing the memory footprint without permanently removing parameters\.
In this work, we propose𝚺\\SigmaB\(Sigma\-Branch\), an umbrella framework for activation\-based hierarchical model reconstruction with single\-path inference\. Within this umbrella, we develop a single concrete instantiation,𝚺\\SigmaB\-Method, the conversion procedure described in Section[III](https://arxiv.org/html/2606.09924#S3), and refer to the network produced by it as𝚺\\SigmaB\-Net\. Throughout the paper,Σ\\SigmaB denotes the framework concept,Σ\\SigmaB\-Method denotes the procedure that operates on a pretrained model, andΣ\\SigmaB\-Net denotes the resulting network whose inference behavior and empirical numbers are reported\. Pretrained models are restructured into hierarchical branched networks, and only a single path is executed at inference\. The total parameter count is preserved while the active\-parameter memory is substantially reduced, with evaluation focused on memory\-bound edge deployment\.
### I\-CContributions of This Work
The contributions of this paper are as follows\.
- •Hierarchical Model Reconstruction \(C1\)\.We proposeΣ\\SigmaB\-Method, a framework that converts pretrained networks into hierarchical, MoE\-like branched structures, distinguishing our approach from flat MoE\-style decomposition\.
- •Cross\-Modal Applicability \(C2\)\.We demonstrate the framework on convolutional networks \(ResNet\-50\) and on point\-cloud networks \(PointNet\+\+\), establishing applicability across two distinct modalities\.
- •Extreme Active\-Parameter Reduction \(C3\)\.Σ\\SigmaB\-Net achieves58–60 %active\-parameter reduction across CIFAR\-100/ResNet\-50, ImageNet/ResNet\-50, and ModelNet40/PointNet\+\+, while preserving the classification accuracy of the dense baseline\. This substantially exceeds state\-of\-the\-art structured pruning baselines \(FPGM, HRank\) at comparable compression ratios\.
## IIRelated Work
We organize previous work into two streams that frame the position ofΣ\\SigmaB\-Method: static structured pruning, which permanently shrinks the model graph \(Section[II\-A](https://arxiv.org/html/2606.09924#S2.SS1)\), and mixture\-of\-experts and hierarchical decomposition, which restructures a dense network to enable input\-dependent computation paths \(Section[II\-B](https://arxiv.org/html/2606.09924#S2.SS2)\)\. We then positionΣ\\SigmaB\-Method against representative methods along four design axes \(Section[II\-C](https://arxiv.org/html/2606.09924#S2.SS3)\)\.
### II\-AStatic Structured Pruning
Static structured pruning permanently removes redundant filters or channels from a pretrained network in an input\-independent manner\. A range of importance criteria has been proposed to identify which parameters to drop, each offering a distinct view of filter or channel redundancy and demonstrating measurable compression benefits at comparable accuracy\. Representative importance criteria include geometric median distance among filters \(FPGM\[[13](https://arxiv.org/html/2606.09924#bib.bib13)\]\) and feature\-map rank \(HRank\[[14](https://arxiv.org/html/2606.09924#bib.bib14)\]\)\.
However, this stream shares a structural limitation that motivates the present work\. Because the parameter removal is permanent, the post\-pruning network has strictly lower representational capacity than the dense baseline; this loss has been shown to disproportionately affect minority\-class and atypical samples\[[15](https://arxiv.org/html/2606.09924#bib.bib15)\], consistent with the representational\-capacity motivation discussed in Section[I\-B](https://arxiv.org/html/2606.09924#S1.SS2)\.
### II\-BMixture\-of\-Experts and Hierarchical Decomposition
A complementary stream introduces input\-dependent paths by decomposing a network into experts or a tree of specialized sub\-networks\. Conventional mixture\-of\-experts \(MoE\) increases capacity by adding many expert branches and routing each input through only the top\-kkof them: sparsely\-gated MoE\[[16](https://arxiv.org/html/2606.09924#bib.bib16)\]and DeepSeekMoE\[[17](https://arxiv.org/html/2606.09924#bib.bib17)\]both follow this expansion strategy, in which the total parameter count grows roughly with the number of experts\. While effective for scaling large language models, this is less suitable for memory\-bound edge accelerators, whose on\-chip capacity limits the total, not just the active, parameter count\.
A second line preserves the total parameter count of a pretrained dense network and instead carves it into experts or a tree\. DeepMoE introduces sample\-level channel routing into a convolutional backbone by gating channel subsets with a shallow embedding network\[[11](https://arxiv.org/html/2606.09924#bib.bib11)\]; routing is per\-input, but the structure remains flat, with no backbone shared across the per\-input sub\-networks\. A recent line of work analytically restructures the feed\-forward \(FFN\) sub\-layer of a pretrained transformer into a mixture of experts using neuron activation statistics alone, requiring no retraining\[[18](https://arxiv.org/html/2606.09924#bib.bib18)\]\. The restructuring, however, is confined to FFN sub\-layers in transformer\-based large language models: the attention modules remain dense, so the model\-wide active\-parameter reduction is bounded by the FFN share of the total compute rather than applied to the network as a whole, and the construction does not extend to convolutional or point\-cloud backbones\. DecisioNet converts a CNN into a binary tree of specialized sub\-networks and routes each input through a single path at inference\[[19](https://arxiv.org/html/2606.09924#bib.bib19)\], which is structurally the closest prior work toΣ\\SigmaB\-Method; however, its tree split is supervised by a class\-confusion hierarchy derived from labels, and its evaluation is confined to convolutional backbones\.
### II\-CPositioning of Sigma\-Branch Method
Table[I](https://arxiv.org/html/2606.09924#S2.T1)positionsΣ\\SigmaB\-Method against representative methods from both streams along four design axes that follow directly from the requirements established in Section[I\-B](https://arxiv.org/html/2606.09924#S1.SS2)\.
TABLE I:Positioning of Sigma\-Branch Method relative to representative compression and dynamic\-inference methods\. ✓: satisfied; –: not satisfied\.A*hierarchical*structure enables progressive capacity decomposition: a shared trunk carries generic features for every sample, while deep leaves specialize to input clusters, yielding partial sharing that a flat MoE cannot achieve\.*Unsupervised*partitioning by activation statistics is required wherever a class\-label hierarchy is unavailable or misaligned with the learnable feature structure, including 3D point\-cloud benchmarks for which no canonical label tree exists\.*Sample\-level*routing is the natural granularity for image and point\-cloud inputs, for which token\-level feed\-forward slicing in transformers does not apply, and it matches the batch\-size\-one regime of edge inference\. Finally,*cross\-modal*validation is required to substantiate a framework\-level claim rather than an architecture\-specific result\. To our knowledge,Σ\\SigmaB\-Method is the only method discussed above that simultaneously satisfies all four axes as required by the memory\-bound edge setting that motivates this work \(Section[I\-B](https://arxiv.org/html/2606.09924#S1.SS2)\)\.
## IIISigma\-Branch Method
We now describeΣ\\SigmaB\-Method, a framework that restructures a pretrained dense network into a hierarchical, single\-path inference network we callΣ\\SigmaB\-Net\. The framework consists of four components: a formal specification of the hierarchical binary\-tree architecture \(Section[III\-A](https://arxiv.org/html/2606.09924#S3.SS1)–[III\-B](https://arxiv.org/html/2606.09924#S3.SS2)\); an activation\-based weight distribution procedure that transfers the pretrained weights into the new architecture \(Section[III\-C](https://arxiv.org/html/2606.09924#S3.SS3)\); a soft\-routing fine\-tuning protocol with a specialist classification loss and a routing responsibility loss \(Section[III\-D](https://arxiv.org/html/2606.09924#S3.SS4)\); and a hard top\-1 inference procedure that executes only a single path per input \(Section[III\-E](https://arxiv.org/html/2606.09924#S3.SS5)\)\. Throughout this section we use a 2\-level, 4\-leaf instantiation as the canonical example\. Binary routing provides a simple recursive decomposition rule compatible with the activation\-based sphericalkk\-means initialization: each split partitions the feature space into two sub\-clusters while keeping the router lightweight and the routing depth logarithmic in the number of leaves\. The\(2,4\)\(2,4\)hierarchy used throughout this work is therefore intended as a minimal canonical instantiation rather than a claim of optimal tree size\. The framework generalizes naturally to deeper trees, and we instantiate it on two distinct backbones in Section[III\-F](https://arxiv.org/html/2606.09924#S3.SS6)\.
### III\-AFramework Overview
Σ\\SigmaB\-Net is a hierarchical binary tree in which a shared backbone is followed by routers, branch\-specific specializers, and per\-leaf classification heads, as illustrated in Fig\.[1](https://arxiv.org/html/2606.09924#S3.F1)\. The shared backbonefBBf\_\{\\mathrm\{BB\}\}processes every input sample and provides the common feature on which the first router operates\. At each levelℓ\\ell, a router projectionπℓ\\pi\_\{\\ell\}produces a low\-dimensional latent, a binary routerRℓR\_\{\\ell\}produces a two\-way routing distribution, and two specializerssℓ\(b\)s\_\{\\ell\}^\{\(b\)\}further process the feature conditional on the chosen branchb∈\{0,1\}b\\in\\\{0,1\\\}\. At the final levelLL, each leafkkhas its own classification headμk\\mu\_\{k\}\. The 2\-level, 4\-leaf canonical configuration therefore has one backbone, two level\-1 specializers, four level\-2 specializers, three routers, and four leaf heads\.
Every router is binary, so that the activation\-based initialization of Section[III\-C](https://arxiv.org/html/2606.09924#S3.SS3)applies recursively at each split\. The network has two modes of operation: at training time, it evaluates all leaves and combines them through their routing probabilities, while at inference time, it selects a single leaf by hard top\-1 routing\.
Figure 1:Overall architecture of Sigma\-Branch Net in the 2\-level, 4\-leaf canonical configuration\. The shared backbone is followed by hierarchical binary routers and per\-branch specializers, with a leaf\-specific classification head at every leaf\.
### III\-BHierarchical Binary\-Tree Formulation
Letxxdenote an input sample andz=fBB\(x\)z=f\_\{\\mathrm\{BB\}\}\(x\)the backbone feature\. At levelℓ\\ell, the router projectionuℓ=πℓ\(⋅\)u\_\{\\ell\}=\\pi\_\{\\ell\}\(\\cdot\)consists of an average pool, a linear layer, layer normalization, and dropout, and produces a low\-dimensional latent\. The binary router applies a linear map followed by softmax,
p\(ℓ\)=softmax\(Rℓ\(uℓ\)\)∈Δ1,p^\{\(\\ell\)\}=\\mathrm\{softmax\}\\\!\\bigl\(R\_\{\\ell\}\(u\_\{\\ell\}\)\\bigr\)\\in\\Delta^\{1\},\(1\)whereΔ1\\Delta^\{1\}denotes the 1\-simplex\. The probability of reaching leafkkfactorizes hierarchically as the product of per\-level probabilities,
pleaf\(k∣x\)=∏ℓ=1Lpbℓ\(k\)\(ℓ\),p\_\{\\mathrm\{leaf\}\}\(k\\mid x\)=\\prod\_\{\\ell=1\}^\{L\}p^\{\(\\ell\)\}\_\{b\_\{\\ell\}\(k\)\},\(2\)wherebℓ\(k\)∈\{0,1\}b\_\{\\ell\}\(k\)\\in\\\{0,1\\\}denotes the branch chosen at levelℓ\\ellon the path to leafkk\. The output of leafkkis obtained by composing the specializers along that path and then applying the leaf head,
yk=μk\(sL\(bL\(k\)\)∘⋯∘s1\(b1\(k\)\)\(z\)\)\.y\_\{k\}=\\mu\_\{k\}\\\!\\bigl\(s\_\{L\}^\{\(b\_\{L\}\(k\)\)\}\\circ\\cdots\\circ s\_\{1\}^\{\(b\_\{1\}\(k\)\)\}\(z\)\\bigr\)\.\(3\)At training time, the network output is the routing\-weighted combinationy^soft=∑k=1Kpleaf\(k∣x\)yk\\hat\{y\}\_\{\\mathrm\{soft\}\}=\\sum\_\{k=1\}^\{K\}p\_\{\\mathrm\{leaf\}\}\(k\\mid x\)\\,y\_\{k\}; at inference, only the single leaf with the highest routing probability is evaluated, as detailed in Section[III\-E](https://arxiv.org/html/2606.09924#S3.SS5)\.
The number of parameters that must be loaded at inference,
Pactive=\|θBB\|\+∑ℓ=1L\|θsℓ\(bℓ\(k∗\)\)\|\+\|θμk∗\|,P\_\{\\mathrm\{active\}\}=\|\\theta\_\{\\mathrm\{BB\}\}\|\+\\sum\_\{\\ell=1\}^\{L\}\|\\theta\_\{s\_\{\\ell\}^\{\(b\_\{\\ell\}\(k^\{\*\}\)\)\}\}\|\+\|\\theta\_\{\\mu\_\{k^\{\*\}\}\}\|,\(4\)is therefore proportional to a single root\-to\-leaf path, while the total parameter countPtotal=\|θBB\|\+2\|θs1\|\+4\|θs2\|\+4\|θμ\|P\_\{\\mathrm\{total\}\}=\|\\theta\_\{\\mathrm\{BB\}\}\|\+2\|\\theta\_\{s\_\{1\}\}\|\+4\|\\theta\_\{s\_\{2\}\}\|\+4\|\\theta\_\{\\mu\}\|in the 4\-leaf configuration is substantially larger\. To preventPtotalP\_\{\\mathrm\{total\}\}from growing beyond the baseline asnnparallel branches are introduced at a level, we reduce the channel width of each specializer by a factor of1/n1/\\sqrt\{n\}relative to the baseline block, so that the overall parameter budget remains comparable to the baseline\. In our canonical ResNet\-50 instantiation, the resulting widths are512512for the shared backbone,≈5122≈724\\approx\\\!512\\sqrt\{2\}\\approx 724for each level\-1 specializer, and≈7242≈1024\\approx\\\!724\\sqrt\{2\}\\approx 1024for each level\-2 specializer\.
### III\-CActivation\-Based Weight Distribution
Jointly training a router and a set of experts from scratch in MoE\-style architectures is known to suffer from a chicken\-and\-egg dependency, as observed in the original sparsely\-gated formulation by Shazeer et al\.\[[16](https://arxiv.org/html/2606.09924#bib.bib16)\]\. In our setting the same dependency arises between the router and the specializers: the router cannot send samples to the appropriate branches unless the specializers are already specialized for distinct input subsets, while the specializers cannot specialize unless the router consistently feeds them their respective subsets\. Without an external intervention, soft routing combined with all\-leaf training resolves this circularity in the trivial direction in which all branches converge toward similar functions, and branch specialization never emerges\.
We therefore initialize the router and the specializers*jointly*, from a single source of structure, rather than through independent procedures\. We eliminate this mismatch by construction: a single sphericalkk\-means clustering of the projection latents provides both the router weight, as cluster centroids, and the per\-branch channel allocation, via cluster\-conditional source\-layer activations, so the router’s per\-sample decision and the specializer’s per\-cluster channel inventory are aligned from the outset\.
Phase 0: backbone transfer\.The early generic\-feature layers of the pretrained baseline are copied verbatim intofBBf\_\{\\mathrm\{BB\}\}\. Their shapes are designed to match, so the transfer is a direct parameter copy and preserves the baseline’s low\-level feature extractor\.
Phase 1: level\-1 clustering and router initialization\.We forward the training set throughfBBf\_\{\\mathrm\{BB\}\}and the level\-1 projectionπ1\\pi\_\{1\}, collecting up toNNlatents\{u1\(i\)\}i=1N\\\{u\_\{1\}^\{\(i\)\}\\\}\_\{i=1\}^\{N\}withN≤50,000N\\leq 50\{,\}000\. We then run a sphericalkk\-means algorithm withk=2k=2\[[20](https://arxiv.org/html/2606.09924#bib.bib20),[21](https://arxiv.org/html/2606.09924#bib.bib21)\]on these latents to obtain centroidsC1∈ℝ2×DC\_\{1\}\\in\\mathbb\{R\}^\{2\\times D\}and cluster assignmentsci∈\{0,1\}c\_\{i\}\\in\\\{0,1\\\}\. SettingR1\.W←C1R\_\{1\}\.W\\leftarrow C\_\{1\}andR1\.b←0R\_\{1\}\.b\\leftarrow 0makessoftmax\(C1u1\)\\mathrm\{softmax\}\\\!\\bigl\(C\_\{1\}\\,u\_\{1\}\\bigr\)a cosine\-similarity\-based binary router whose decision boundary coincides with thekk\-means partition\. In parallel, we forward the same samples through the baseline up to the level\-1 source block and accumulate the pooled per\-channel mean activationa\(c\)a^\{\(c\)\}conditioned on each clustercc\.
Phase 2: shared and branch\-specific channel selection\.The level\-1 specializer width isWL1W\_\{L1\}; we allocateKshared=⌊κWL1⌋K\_\{\\mathrm\{shared\}\}=\\lfloor\\kappa W\_\{L1\}\\rfloorchannels as shared between the two branches andWL1−KsharedW\_\{L1\}\-K\_\{\\mathrm\{shared\}\}as branch\-specific, whereκ∈\[0,1\]\\kappa\\in\[0,1\]controls the trade\-off and is fixed to0\.50\.5in our experiments\. The shared channels are chosen as the top\-KsharedK\_\{\\mathrm\{shared\}\}indices ofmin\(a\(0\),a\(1\)\)\\min\\\!\\bigl\(a^\{\(0\)\},a^\{\(1\)\}\\bigr\), so that a channel qualifies as shared only when it is strongly activated in*both*clusters; this prevents channels that are strong in only one cluster from being routed to the shared pool\. The branch\-specific channels for clusterccare then chosen from the remaining channels by the contrast scorea\(c\)−a\(1−c\)a^\{\(c\)\}\-a^\{\(1\-c\)\}, capturing features that are relatively more active for that cluster\. For convolutional baselines, the bottleneck’s internal plane indices are additionally selected by L1\-norm and shared across the two branches\. The resulting index set defines which baseline channels are copied intos1\(A\)s\_\{1\}^\{\(A\)\}ands1\(B\)s\_\{1\}^\{\(B\)\}\. Fig\.[2](https://arxiv.org/html/2606.09924#S3.F2)visualizes the partition for our ResNet\-50 instantiation: each baseline channel is plotted as one point in the\(a\(0\),a\(1\)\)\(a^\{\(0\)\},a^\{\(1\)\}\)plane, with channels near the diagonal becoming shared and the off\-diagonal channels becoming branch\-specific\. The decomposition mirrors the shared\-expert pattern of DeepSeekMoE\[[17](https://arxiv.org/html/2606.09924#bib.bib17)\], where always\-active shared experts free routed experts to specialize sharply; in our setting, shared channels carry universally useful features while branch\-specific channels capture cluster\-distinctive ones\.
Phases 3–4: level\-2 recursion\.Using the initialized level\-1 components, we forward each sample only through the branch chosen byR1R\_\{1\}, collecting per\-branch latentsuAu\_\{A\}anduBu\_\{B\}at the level\-2 projection\. Sphericalkk\-means withk=2k=2is then applied within each branch to obtain centroidsCA,CBC\_\{A\},C\_\{B\}, which initializeRL2,AR\_\{L2,A\}andRL2,BR\_\{L2,B\}\. Cluster\-conditional activations of the level\-2 source block are aggregated within each branch, and the shared and branch\-specific selection of Phase 2 is repeated to produce the four level\-2 specializerss2\(A1\),s2\(A2\),s2\(B1\),s2\(B2\)s\_\{2\}^\{\(A1\)\},s\_\{2\}^\{\(A2\)\},s\_\{2\}^\{\(B1\)\},s\_\{2\}^\{\(B2\)\}\.
Phase 5: leaf head transfer\.Finally, each leaf headμk\\mu\_\{k\}inherits the column subset of the baseline classifier weight that corresponds to its allocated output channels\. The classifier bias, which is indexed by output class and does not depend on the input channels, is shared identically across all leaf heads\. Each leaf head thus starts from a partial copy of the baseline classifier restricted to the channel subset that its specializer produces, and the four leaves differ only in the columns of the inherited weight matrix\.
Geometrically, the procedure performs activation\-based hierarchical partitioning of the input distribution: each level partitions the remaining feature space by sphericalkk\-means and inherits the baseline channels that are most active on the resulting clusters\. The router projectionπℓ\\pi\_\{\\ell\}is kept close to linear during initialization by omitting the inner non\-linearity, so that the spherical structure of thekk\-means cluster centroids in projection space carries over directly to the router weights\.
Figure 2:Shared and branch\-specific channel selection on the level\-1 source block of a pretrained ResNet\-50 over ImageNet\-1k\. Each point is one of the10241024output channels of the source layer, plotted by its cluster\-conditional mean activationsa\(0\)a^\{\(0\)\}anda\(1\)a^\{\(1\)\}\. Channels with largemin\(a\(0\),a\(1\)\)\\min\(a^\{\(0\)\},a^\{\(1\)\}\), i\.e\. those strongly activated in both clusters, are selected as shared; the remaining channels with large\|a\(0\)−a\(1\)\|\|a^\{\(0\)\}\-a^\{\(1\)\}\|are selected as branch\-specific\. The axes are clipped at0\.150\.15so that the bulk of the channel distribution is visible; a small number of high\-activation outliers in the shared set fall outside the panel\.Algorithm 1Activation\-based weight distribution for Sigma\-Branch Net1:Pretrained baseline
ℳbase\\mathcal\{M\}\_\{\\mathrm\{base\}\}; training set
𝒟\\mathcal\{D\}; shared\-channel ratio
κ\\kappa; depth
LL
2:Initialized
Σ\\SigmaB\-Net parameters
3:
fBB\.θ←f\_\{\\mathrm\{BB\}\}\.\\theta\\leftarrowshared\-trunk weights of
ℳbase\\mathcal\{M\}\_\{\\mathrm\{base\}\}⊳\\trianglerightPhase 0
4:for
ℓ=1\\ell=1to
LLdo
5:Collect latents
\{uℓ\(i\)=πℓ\(zℓ\(i\)\)\}i=1N\\\{u\_\{\\ell\}^\{\(i\)\}=\\pi\_\{\\ell\}\(z\_\{\\ell\}^\{\(i\)\}\)\\\}\_\{i=1\}^\{N\}over
𝒟\\mathcal\{D\}
6:
\(Cℓ,ci\)←SphericalKMeans\(\{uℓ\(i\)\},k=2\)\(C\_\{\\ell\},\\,c\_\{i\}\)\\leftarrow\\textsc\{SphericalKMeans\}\(\\\{u\_\{\\ell\}^\{\(i\)\}\\\},\\,k=2\)⊳\\trianglerightPhaseℓ\\ell\-a
7:
Rℓ\.W←Cℓ,Rℓ\.b←0R\_\{\\ell\}\.W\\leftarrow C\_\{\\ell\},\\;R\_\{\\ell\}\.b\\leftarrow 0
8:forbranch
b∈\{0,1\}b\\in\\\{0,1\\\}at level
ℓ\\elldo
9:
a\(c\)←a^\{\(c\)\}\\leftarrowcluster\-conditional pooled activation of baseline level\-
ℓ\\ellsource layer
10:
shared←argtop⌊κWℓ⌋\(min\(a\(0\),a\(1\)\)\)\\mathrm\{shared\}\\leftarrow\\arg\\mathrm\{top\}\_\{\\lfloor\\kappa W\_\{\\ell\}\\rfloor\}\\\!\\bigl\(\\min\(a^\{\(0\)\},a^\{\(1\)\}\)\\bigr\)
11:
specificb←argtopWℓ−⌊κWℓ⌋\(a\(b\)−a\(1−b\)\)\\mathrm\{specific\}\_\{b\}\\leftarrow\\arg\\mathrm\{top\}\_\{W\_\{\\ell\}\-\\lfloor\\kappa W\_\{\\ell\}\\rfloor\}\\\!\\bigl\(a^\{\(b\)\}\-a^\{\(1\-b\)\}\\bigr\)
12:
out\_idxb←shared∪specificb\\mathrm\{out\\\_idx\}\_\{b\}\\leftarrow\\mathrm\{shared\}\\cup\\mathrm\{specific\}\_\{b\}
13:Copy baseline level\-
ℓ\\ellblock channels indexed by
out\_idxb\\mathrm\{out\\\_idx\}\_\{b\}into
sℓ\(b\)s\_\{\\ell\}^\{\(b\)\}⊳\\trianglerightPhaseℓ\\ell\-b
14:endfor
15:endfor
16:foreach leaf
kkdo
17:
μk\.fc←ℳbase\.fc\[:,out\_idxk\]\\mu\_\{k\}\.\\mathrm\{fc\}\\leftarrow\\mathcal\{M\}\_\{\\mathrm\{base\}\}\.\\mathrm\{fc\}\[:,\\,\\mathrm\{out\\\_idx\}\_\{k\}\]⊳\\trianglerightPhaseL\+1L\{\+\}1
18:endfor
### III\-DSoft\-Routing Fine\-Tuning
After activation\-based initialization, the backbone, routers, specializers, and leaf heads are all initialized from the pretrained baseline; only the router projectionsπℓ\\pi\_\{\\ell\}have no baseline counterpart and remain randomly initialized\. The router projections are nevertheless used during the sphericalkk\-means step of Phases 1 and 3, and the resulting cluster structure is preserved because sphericalkk\-means depends only on angular similarity, which random linear projections approximately preserve in expectation\[[22](https://arxiv.org/html/2606.09924#bib.bib22)\]\. After initialization, all parameters are jointly fine\-tuned with soft routing, in which every leaf is evaluated for every input, and the losses are combined through the routing probabilities\.
We use a*specialist*classification loss in which the leaf\-wise cross\-entropy losses are weighted by the routing probabilitiespleafp\_\{\\mathrm\{leaf\}\}:
ℒcls=𝔼x\[∑k=1Kpleaf\(k∣x\)CE\(yk\(x\),y\)\]\.\\mathcal\{L\}\_\{\\mathrm\{cls\}\}=\\mathbb\{E\}\_\{x\}\\\!\\left\[\\sum\_\{k=1\}^\{K\}p\_\{\\mathrm\{leaf\}\}\(k\\mid x\)\\,\\mathrm\{CE\}\\\!\\bigl\(y\_\{k\}\(x\),\\,y\\bigr\)\\right\]\.\(5\)This formulation differs from a naive ensemble cross\-entropy on the weighted\-average logits in that each leaf must itself produce a correct prediction when its routing probability is high, which directly aligns the soft\-routing training objective with the hard top\-1 inference behavior at test time\.
To further align the router decision with the per\-leaf prediction quality, we augment the objective with a responsibility\-matching loss adapted from the gating literature\. The responsibilityrkr\_\{k\}of leafkkis defined by the soft minimum of the leaf\-wise cross\-entropy losses,
rk\(x\)=exp\(−CE\(yk\(x\),y\)/τr\)∑jexp\(−CE\(yj\(x\),y\)/τr\),r\_\{k\}\(x\)=\\frac\{\\exp\\\!\\bigl\(\-\\mathrm\{CE\}\(y\_\{k\}\(x\),y\)/\\tau\_\{r\}\\bigr\)\}\{\\sum\_\{j\}\\exp\\\!\\bigl\(\-\\mathrm\{CE\}\(y\_\{j\}\(x\),y\)/\\tau\_\{r\}\\bigr\)\},\(6\)and is treated as a stop\-gradient target\. The routing probabilities are trained to match this responsibility distribution through a cross\-entropy penalty,
ℒresp=−𝔼x\[∑krk\(x\)logpleaf\(k∣x\)\]\.\\mathcal\{L\}\_\{\\mathrm\{resp\}\}=\-\\,\\mathbb\{E\}\_\{x\}\\\!\\left\[\\sum\_\{k\}r\_\{k\}\(x\)\\,\\log p\_\{\\mathrm\{leaf\}\}\(k\\mid x\)\\right\]\.\(7\)The total training objective is
ℒ=ℒcls\+λrespℒresp,\\mathcal\{L\}=\\mathcal\{L\}\_\{\\mathrm\{cls\}\}\+\\lambda\_\{\\mathrm\{resp\}\}\\,\\mathcal\{L\}\_\{\\mathrm\{resp\}\},\(8\)whereλresp\\lambda\_\{\\mathrm\{resp\}\}is a fixed scalar coefficient andτr\\tau\_\{r\}is a fixed scalar that controls the sharpness of the responsibility distribution\. We set them toλresp=0\.3\\lambda\_\{\\mathrm\{resp\}\}=0\.3andτr=0\.3\\tau\_\{r\}=0\.3\.
### III\-EHard Top\-1 Inference
At inference,Σ\\SigmaB\-Net evaluates only a single root\-to\-leaf path\. Starting from the backbone featurez=fBB\(x\)z=f\_\{\\mathrm\{BB\}\}\(x\), the level\-1 router selects a branch by hard argmax,
bℓ∗=argmaxj∈\{0,1\}pj\(ℓ\),b\_\{\\ell\}^\{\*\}=\\arg\\max\_\{j\\in\\\{0,1\\\}\}\\,p^\{\(\\ell\)\}\_\{j\},\(9\)and only the chosen specializersℓ\(bℓ∗\)s\_\{\\ell\}^\{\(b\_\{\\ell\}^\{\*\}\)\}is executed\. The selection is sequential rather than joint: the level\-2 projection input is the output of the chosen level\-1 specializer, so the level\-2 routing distribution exists only along the branch chosen at level 1\. Once the leafk∗k^\{\*\}corresponding to the sequence of argmax decisions is determined, the prediction is taken directly from the selected leaf head,y^=yk∗\\hat\{y\}=y\_\{k^\{\*\}\}\.
The active\-parameter count of one inference is thereforePactiveP\_\{\\mathrm\{active\}\}from \([4](https://arxiv.org/html/2606.09924#S3.E4)\), comprising the backbone, one specializer per level, and one leaf head\. In the 4\-leaf configuration, three of the four leaf heads, three of the four level\-2 specializers, and one of the two level\-1 specializers are entirely skipped\.
### III\-FModality\-Agnostic Applicability
The only assumption the framework places on the baseline is thatfBBf\_\{\\mathrm\{BB\}\}produces a pooled global feature on which the router projectionπℓ\\pi\_\{\\ell\}can operate; the sphericalkk\-means clustering, the cluster\-conditional channel selection, and the per\-leaf path selection are otherwise modality\-independent\. The framework therefore extends to any such backbone, including convolutional networks \(global average pooling\) and PointNet\+\+\-style networks \(set\-abstraction descriptors\)\. Per\-modality choices of source block, specializer width, and training schedule are reported in Section[IV](https://arxiv.org/html/2606.09924#S4)\.
## IVExperimental Setup
### IV\-AHardware and Software Environment
All models are trained with NVIDIA A100 GPUs\. The implementation uses PyTorch with the timm training utilities\.
### IV\-BModels and Datasets
We instantiateΣ\\SigmaB\-Method on three backbone–dataset combinations, covering image classification \(ResNet\-50 on CIFAR\-100, ResNet\-50 on ImageNet\-1K\) and point\-cloud classification \(PointNet\+\+ on ModelNet40\)\.
#### IV\-B1CIFAR\-100 with ResNet\-50
CIFAR\-100\[[23](https://arxiv.org/html/2606.09924#bib.bib23)\]contains 50,000 training and 10,000 test images at32×3232\{\\times\}32resolution across 100 classes\. The baseline is a timm ResNet\-50\[[1](https://arxiv.org/html/2606.09924#bib.bib1)\]with the CIFAR\-style modification of a3×33\{\\times\}3stride\-1 stem and no max\-pool\. TheΣ\\SigmaB\-Net counterpart uses the stem and layers 1–2 as the shared trunk \(backbone width 512\), six bottleneck blocks per level\-1 specializer at width 724 \(≈5122\\approx 512\\sqrt\{2\}\), and three blocks per level\-2 specializer at width 1024 \(≈7242\\approx 724\\sqrt\{2\}\)\. All four leaves share the same width and depth, soPactiveP\_\{\\mathrm\{active\}\}is path\-invariant\. The projection dimension used by the sphericalkk\-means router is 128\.
#### IV\-B2ImageNet\-1K with ResNet\-50
ImageNet\-1K\[[24](https://arxiv.org/html/2606.09924#bib.bib24)\]consists of 1,281,167 training and 50,000 validation images across 1,000 classes\. The baseline is a timm ResNet\-50 with the standard7×77\{\\times\}7stride\-2 stem followed by max\-pool\. TheΣ\\SigmaB\-Net uses the same specializer widths as the CIFAR\-100 instantiation; only the stem is replaced with the ImageNet variant\.
#### IV\-B3ModelNet40 with PointNet\+\+
ModelNet40\[[25](https://arxiv.org/html/2606.09924#bib.bib25)\]consists of 9,843 training and 2,468 test point clouds across 40 classes, uniformly sampled to 1,024 points per shape\. The baseline is the PointNet\+\+\[[7](https://arxiv.org/html/2606.09924#bib.bib7)\]single\-scale\-grouping variant\. TheΣ\\SigmaB\-Net uses SA1 and SA2 as the shared trunk \(global feature dimension 256\), an SA3\-equivalent ball\-query MLP at width 512 for each level\-1 specializer, an FC1\-equivalent MLP at width 256 for each level\-2 specializer, and an FC2\-equivalent head per leaf\.
### IV\-CTraining Protocol
Table[II](https://arxiv.org/html/2606.09924#S4.T2)summarizes the training recipe for each combination\. For image classification, we follow the timm\-style training protocol: SGD with Nesterov momentum, an initial learning rate of0\.10\.1, cosine annealing\[[26](https://arxiv.org/html/2606.09924#bib.bib26)\]with five warm\-up epochs, and label smoothing of0\.10\.1\. CIFAR\-100 runs for 200 epochs with weight decay10−410^\{\-4\}; ImageNet\-1K runs for 90 epochs\. For point\-cloud classification we follow the PointNet\+\+ recipe: Adam\[[27](https://arxiv.org/html/2606.09924#bib.bib27)\]with initial learning rate10−310^\{\-3\}, weight decay10−410^\{\-4\}, and step decay \(γ=0\.7\\gamma=0\.7every 20 epochs\) for 200 epochs\.
TABLE II:Training hyper\-parameters per backbone–dataset combination\.
### IV\-DBaselines for Comparison
We organize the comparison around three baseline families, each chosen to isolate a specific axis of the trade\-off space rather than to claim head\-to\-head dominance on every axis\.
#### IV\-D1Original models \(D\-1\)
The first comparison is against the unconverted backbones themselves: ResNet\-50\[[1](https://arxiv.org/html/2606.09924#bib.bib1)\]on CIFAR\-100 and ImageNet\-1K, and PointNet\+\+\[[7](https://arxiv.org/html/2606.09924#bib.bib7)\]on ModelNet40\. Each baseline is retrained in our environment under the protocol of Table[II](https://arxiv.org/html/2606.09924#S4.T2), and we report accuracy, FLOPs, and active parameters before and afterΣ\\SigmaB\-Method is applied\. This establishes whether the reconstructed network preserves the predictive quality of its source model\.
#### IV\-D2Static filter pruning \(D\-2\)
We compare against two representative static filter\-pruning methods on ResNet\-50: FPGM\[[13](https://arxiv.org/html/2606.09924#bib.bib13)\]and HRank\[[14](https://arxiv.org/html/2606.09924#bib.bib14)\], both of which report ImageNet\-1K ResNet\-50 numbers at multiple compression ratios from their own dense baseline\. This baseline characterizes the static FLOPs\-reduction frontier against which the dynamic\-path approach is positioned\.
#### IV\-D3Sample\-level dynamic computation \(D\-3\)
We compare against SkipNet\[[12](https://arxiv.org/html/2606.09924#bib.bib12)\], which augments a ResNet with recurrent gating modules that decide, per input, whether each residual block is executed or skipped, and against the narrow variant of DeepMoE\[[11](https://arxiv.org/html/2606.09924#bib.bib11)\], which gates individual output channels of each convolution via a shallow embedding network\. Both methods share withΣ\\SigmaB\-Net the property of input\-conditional path selection, but differ in granularity: SkipNet operates at the block level with stochastic binary gates, DeepMoE at the channel level with continuous ReLU gates, andΣ\\SigmaB\-Net at the leaf level with hard top\-1 routing through a hierarchical tree\.
Two further entries appearing in the positioning table \(Table[I](https://arxiv.org/html/2606.09924#S2.T1)\)—DecisioNet\[[19](https://arxiv.org/html/2606.09924#bib.bib19)\]and the analytical FFN\-to\-MoE conversion of\[[18](https://arxiv.org/html/2606.09924#bib.bib18)\]—are retained as related work, but excluded from numerical comparison\. DecisioNet builds a hierarchical binary tree of specialized paths, but trains the network from scratch using a*supervised*label\-derived class hierarchy; theΣ\\SigmaB\-Method, by contrast, fine\-tunes a pretrained dense model using an*unsupervised*activation\-based partition \(Section[III\-C](https://arxiv.org/html/2606.09924#S3.SS3)\)\. A head\-to\-head comparison would, therefore, conflate the training regime with the architectural design\. The analytical FFN\-to\-MoE method targets the feed\-forward sub\-layers of a pretrained Transformer with no retraining, and does not apply to convolutional or point\-cloud backbones, which are within the scope of our experiments\. Both methods remain in the related\-work positioning \(Section[II](https://arxiv.org/html/2606.09924#S2)\) to clarify whereΣ\\SigmaB\-Method sits in the design space\.
### IV\-EEvaluation Metrics
For each method and combination, we report the following metrics\.
\(i\)Top\-1 and Top\-5 accuracyon the held\-out test set, or the validation set for ImageNet\-1K\.
\(ii\)Active parametersPactiveP\_\{\\mathrm\{active\}\}as defined in Equation \([4](https://arxiv.org/html/2606.09924#S3.E4)\), counted along the hard\-routed path\. The active\-parameter metric is intended as an analytical proxy for per\-inference off\-chip weight\-transfer volume under memory\-bound deployment, rather than as a direct measurement of wall\-clock latency\.
\(iii\)FLOPs, measured with fvcore on a forward pass under hard top\-1 routing with batch size one, so that the data\-dependent routing short\-circuits to a single executed path\.
\(iv\)Routing distribution, a qualitative analysis of which leaves are selected for each class\. The per\-class leaf usage is visualized as a heatmap and serves as an interpretability check on the unsupervised partition rather than a quantitative score\.
Across \(i\)–\(iii\), the three baseline families differ in how they occupy the FLOPs–PactiveP\_\{\\mathrm\{active\}\}plane: static pruning compresses both axes by similar fractions but does so permanently, sample\-level dynamic computation reduces both axes through input\- dependent soft gating, andΣ\\SigmaB\-Method reduces both axes through input\-dependent hard top\-1 routing while retaining the full dense parameter set\. Section[V](https://arxiv.org/html/2606.09924#S5)reports the corresponding numerical comparison\.
## VResults
### V\-AAccuracy and Compute Trade\-off across Tasks
Table[III](https://arxiv.org/html/2606.09924#S5.T3)summarizes the main results across the three backbone–dataset combinations\. The primary observation is thatΣ\\SigmaB\-Net reduces the active\-parameter footprintPactiveP\_\{\\mathrm\{active\}\}by 58–60 % relative to its dense baseline while remaining either within 0\.1 pp of the baseline accuracy \(CIFAR\-100\), within 1\.7 pp \(ImageNet\-1K\), or above it \(ModelNet40\)\. FLOPs at hard top\-1 inference are reduced by 10–32 %, with the variation reflecting how much of the network is shared across all inputs in each backbone\.
TABLE III:Main results across the three backbone–dataset combinations\. Active param\.↓\\downarrowis the reduction inPactiveP\_\{\\mathrm\{active\}\}relative to the dense baseline\.On CIFAR\-100 with ResNet\-50, theΣ\\SigmaB\-Net obtained from a 60\-epoch fine\-tune of the 200\-epoch dense baseline retains Top\-1 within 0\.07 pp \(76\.99 %→\\to76\.92 %\) while reducingPactiveP\_\{\\mathrm\{active\}\}by 60\.3 %\. Top\-5 also improves by 0\.65 pp\. Hard\-routed FLOPs drop by 32\.0 %\.
On ImageNet\-1K with ResNet\-50, theΣ\\SigmaB\-Net is fine\-tuned for 100 epochs from a 90\-epoch dense baseline\. Top\-1 drops by 1\.72 pp \(76\.54 %→\\to74\.82 %\) whilePactiveP\_\{\\mathrm\{active\}\}falls by 59\.5 % and FLOPs by 31\.1 %\.
On ModelNet40 with PointNet\+\+, theΣ\\SigmaB\-Net improves Top\-1 by 1\.10 pp \(90\.15 %→\\to91\.25 %\) while reducingPactiveP\_\{\\mathrm\{active\}\}by 58\.3 %\. FLOPs reduction is modest \(10\.3 %\) because the shared trunk SA1 \+ SA2, which is executed for every input, dominates the per\-sample compute on point clouds\.
### V\-BComparison with Static Structured Pruning
Table[IV](https://arxiv.org/html/2606.09924#S5.T4)comparesΣ\\SigmaB\-Net on ImageNet\-1K ResNet\-50 against two representative static filter\-pruning methods that report results on their own dense baseline: FPGM\[[13](https://arxiv.org/html/2606.09924#bib.bib13)\]at two pruning rates, and HRank\[[14](https://arxiv.org/html/2606.09924#bib.bib14)\]at three sparsity levels\. We use the numbers reported in the original publications\.
TABLE IV:Comparison with static structured pruning on ImageNet\-1K ResNet\-50\.§\\Smarks analytic values derived from the published FPGM pruning rule\.At a comparable Top\-1 \(74\.8–75\.0 %\),Σ\\SigmaB\-Net’sPactiveP\_\{\\mathrm\{active\}\}reduction is 14–23 pp larger than that of the static methods: 59\.5 % versus 43\.0 % for FPGM\-40 \(Top\-1 74\.83 %\) and 36\.7 % for HRank\-44 \(Top\-1 74\.98 %\)\. On the FLOPs axis the trade\-off is reversed: at the same Top\-1 band, the static methods remove 42\.2–53\.5 % of FLOPs whileΣ\\SigmaB\-Net removes 31\.1 %\. The implications of this asymmetric position on the active\-parameter–FLOPs plane are discussed in Section[VI\-A](https://arxiv.org/html/2606.09924#S6.SS1)\.
### V\-CComparison with Sample\-Level Dynamic Computation
We compareΣ\\SigmaB\-Net against two representative sample\-level dynamic\-inference methods on ImageNet\-1K ResNet\-50: SkipNet\[[12](https://arxiv.org/html/2606.09924#bib.bib12)\], which selects per\-input residual blocks to skip, and the narrow variant of DeepMoE\[[11](https://arxiv.org/html/2606.09924#bib.bib11)\], which gates output channels per input via a shallow embedding network\. Table[V](https://arxiv.org/html/2606.09924#S5.T5)reports the comparison; active\-parameter and FLOPs reductions are computed per input from gate masks on the ImageNet\-1K validation set and averaged\.
TABLE V:Comparison with sample\-level dynamic computation on ImageNet\-1K ResNet\-50\.Σ\\SigmaB\-Net reports both the highest Top\-1 and the largest active\-parameter reduction in Table[V](https://arxiv.org/html/2606.09924#S5.T5)\. The structural distinction is that each input selects exactly one root\-to\-leaf path ofΣ\\SigmaB\-Net, fixing the active\-parameter mass by architecture rather than by trained gate values; SkipNet’s per\-block Bernoulli gating and DeepMoE’s continuous channel gating both produce sample\-dependent active footprints, which complicate deterministic resource provisioning on edge accelerators\.
### V\-DRouting Behavior
We inspect the learned routing on the CIFAR\-100Σ\\SigmaB\-Net\. The validation\-set leaf utilization under hard top\-1 routing is\(A1,A2,B1,B2\)=\(20\.0,23\.6,13\.9,42\.6\)%\(A\_\{1\},A\_\{2\},B\_\{1\},B\_\{2\}\)=\(20\.0,23\.6,13\.9,42\.6\)\\,\\%, giving a balance score of 0\.935 on the\[0,1\]\[0,1\]scale where one denotes the uniform distribution and zero denotes complete collapse to a single leaf\. The routers are well separated, and no leaf is starved\.
Fig\.[3](https://arxiv.org/html/2606.09924#S5.F3)shows the per\-class leaf\-assignment distribution: for each of the 100 CIFAR\-100 classes, we report the fraction of validation samples routed to each leaf under hard top\-1\. The heatmap exhibits a clear block structure: distinct subsets of classes concentrate on different leaves, and most classes route into a single dominant leaf rather than distributing uniformly\. This is the qualitative signature predicted by the activation\-based initialization in Section[III\-C](https://arxiv.org/html/2606.09924#S3.SS3): by construction, the leaves are seeded from cluster\-conditional channel subsets, and the responsibility loss in Section[III\-D](https://arxiv.org/html/2606.09924#S3.SS4)preserves this clustering through fine\-tuning\.
Figure 3:Per\-class leaf\-assignment heatmap for Sigma\-Branch Net on CIFAR\-100 \(seed\-42 reproduction of the main\-tableΣ\\SigmaB\-Net configuration\)\. Rows are the 100 CIFAR\-100 classes \(re\-ordered by dominant leaf\); columns are the four leaves\(A1,A2,B1,B2\)\(A\_\{1\},A\_\{2\},B\_\{1\},B\_\{2\}\); cell intensity is the fraction of validation samples of that class routed to that leaf under hard top\-1 routing\.
### V\-EAblations
We ablate the two non\-trivial design choices in Section[III](https://arxiv.org/html/2606.09924#S3): the activation\-based channel partition used by the pretrained\-baseline initialization \(Section[III\-C](https://arxiv.org/html/2606.09924#S3.SS3)\) and the routing\-responsibility loss \(Section[III\-D](https://arxiv.org/html/2606.09924#S3.SS4)\)\. All ablations are run on ImageNet\-1K with ResNet\-50 under the same fine\-tuning recipe as the main result\. The default configuration matches the main\-results row\.
TABLE VI:Ablations on ImageNet\-1K ResNet\-50\. Balance is the entropy\-normalized leaf utilization \(one = uniform\)\.The two ablations expose complementary roles of the activation\-based channel partition and the routing\-responsibility loss\. Randomizing the channel partition while keeping every other component of the initialization pipeline intact makes the two L1 specializers functionally similar: their output channels are now drawn uniformly from the same pool, so neither specializer develops a cluster\-conditional representation that the L1 router can exploit\. The L1 router consequently collapses to a single branch and the network degenerates into a two\-leaf tree, with a Top\-1 drop of0\.870\.87\\,pp on ImageNet\-1K\. Settingλresp=0\\lambda\_\{\\mathrm\{resp\}\}=0removes the explicit incentive for the L1 router to use both branches:A1A\_\{1\}andA2A\_\{2\}each receive 0 % of validation samples, and the accuracy drops by1\.481\.48\\,pp\.
## VIDiscussion
### VI\-APosition on the Active\-Parameter / Accuracy Trade\-off
Σ\\SigmaB\-Net occupies a distinct operating point on the multi\-axis efficiency Pareto front, dominating the active\-parameter axis while offering only moderate FLOPs reduction \(Sections[V\-A](https://arxiv.org/html/2606.09924#S5.SS1)–[V\-C](https://arxiv.org/html/2606.09924#S5.SS3)\)\. The active\-parameter advantage is structural rather than incidental: static structured pruning removes filters permanently to compress both axes, and sample\-level dynamic computation gates entire blocks to skip compute, whereasΣ\\SigmaB\-Net reduces the active\-parameter footprint via per\-inference path selection over a hierarchical binary tree, with the dense total parameter count preserved\. DecisioNet\[[19](https://arxiv.org/html/2606.09924#bib.bib19)\]is the structurally closest prior work, but its supervised label\-tree construction is restricted to convolutional backbones\.
### VI\-BImplications for Memory\-Bound Deployment
The distinction between FLOPs and active parameters is central in the memory\-bound regime\. FLOPs measure arithmetic work, whereasPactiveP\_\{\\mathrm\{active\}\}measures the amount of model state that must be resident on\-chip or transferred from off\-chip memory for a single inference\. When the dense model does not fit in on\-chip memory and inference is performed at batch size one, weights cannot be amortized across a large batch, and off\-chip weight traffic can dominate latency even when arithmetic units are underutilized\. Thus, reducingPactiveP\_\{\\mathrm\{active\}\}targets a different bottleneck from conventional FLOPs reduction\.
The active\-parameter advantage translates analytically into reduced per\-inference DRAM read traffic on memory\-constrained accelerators, where on\-chip memory cannot hold the dense network, and per\-inference latency is governed by off\-chip weight transfer under the batch\-size\-one regime typical of edge inference\[[8](https://arxiv.org/html/2606.09924#bib.bib8)\]\. Because only the backbone, routers, and a single leaf execute, the transferred volume scales withPactiveP\_\{\\mathrm\{active\}\}; an active\-parameter reduction of approximately 60 % therefore reduces analytical off\-chip transfer by approximately 60 %, corresponding under FP32 storage to roughly 41 MB per inference for the ResNet\-50Σ\\SigmaB\-Net versus 102 MB for the dense baseline\. Unlike static structured pruning, which achieves a comparable reduction only at the cost of permanently lowered capacity\[[15](https://arxiv.org/html/2606.09924#bib.bib15)\],Σ\\SigmaB\-Net decouples per\-inference memory traffic from the total parameter count\.
Realizing this analytical reduction as wall\-clock latency improvement is non\-trivial: naive on\-demand fetching after the router’s argmax decision breaks the prefetch–compute overlap that hides memory latency, and selective weight transfer requires runtime support together with a branch\-aligned memory layout that preserves burst\-transfer efficiency\. These implementation aspects are deferred to future work \(Section[VI\-D](https://arxiv.org/html/2606.09924#S6.SS4)\)\. The present work isolates the algorithmic contribution of active\-parameter reduction from hardware\-specific runtime co\-design, which we intentionally leave outside the scope of this paper\.
### VI\-CLimitations
Three limitations point to directions of future work\. First, a non\-trivial accuracy gap remains on ImageNet\-1K \(Section[V\-A](https://arxiv.org/html/2606.09924#S5.SS1)\); a plausible cause is that the four\-leaf structure does not fully absorb the intra\-class variability of 1000 classes\. Second, the modality scope does not yet cover Vision Transformer\[[2](https://arxiv.org/html/2606.09924#bib.bib2)\]architectures, which a framework\-level generality claim would ultimately require\. Third, the reductions reported here are analytical; empirical wall\-clock validation on memory\-constrained accelerators is left for future work\.
### VI\-DFuture Work
Three directions follow from the limitations above\. First, extension to Vision Transformer backbones via a hybrid approach—dense self\-attention with hierarchically restructured feed\-forward sub\-layers—would broaden modality coverage; recent analytical FFN\-to\-MoE conversion\[[18](https://arxiv.org/html/2606.09924#bib.bib18)\]offers a flat, training\-free counterpart against which a hierarchical fine\-tuned variant could be compared\. Second, the fixed\(2,4\)\(2,4\)tree configuration could be replaced with a data\-driven choice via activation\-clustering criteria \(e\.g\. silhouette or gap statistics\) or small\-scale architecture search, better fitting larger label spaces and asymmetric input distributions\. Third, hardware\-level realization on memory\-constrained accelerators is required to convert the analytical traffic reduction into wall\-clock latency improvement, involving benchmarking on accelerators with explicit on\-chip / off\-chip hierarchies, runtime support for selective weight prefetching aligned with router decisions, and quantization\-aware leaf placement to maximize on\-chip residency\.
## VIIConclusion
We introducedΣ\\SigmaB\-Method, a framework that converts a pretrained dense network into a hierarchical binary tree executed at inference as a single hard top\-1 path\. Across CIFAR\-100 / ResNet\-50, ImageNet\-1K / ResNet\-50, and ModelNet40 / PointNet\+\+,Σ\\SigmaB\-Net reduces the per\-inference active\-parameter footprint by 58–60 % while preserving classification accuracy within 1\.72 pp of the dense baseline\. By retaining the full dense parameter set,Σ\\SigmaB\-Method decouples per\-inference memory traffic from the total parameter count, occupying a distinct operating point relative to static pruning and sample\-level dynamic computation, and the cross\-modal evaluation substantiates a framework\-level claim\.
## References
- \[1\]K\. He, X\. Zhang, S\. Ren, and J\. Sun, “Deep residual learning for image recognition,” in*Proc\. IEEE Conf\. Comput\. Vis\. Pattern Recognit\. \(CVPR\)*, 2016, pp\. 770–778\.
- \[2\]A\. Dosovitskiy, L\. Beyer, A\. Kolesnikov, D\. Weissenborn, X\. Zhai, T\. Unterthiner, M\. Dehghani, M\. Minderer, G\. Heigold, S\. Gelly, J\. Uszkoreit, and N\. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in*Proc\. Int\. Conf\. Learn\. Represent\. \(ICLR\)*, 2021\.
- \[3\]J\. Chen and X\. Ran, “Deep Learning With Edge Computing: A Review,”*Proceedings of the IEEE*, vol\. 107, no\. 8, pp\. 1655–1674, Aug\. 2019\.
- \[4\]A\. G\. Howard, M\. Zhu, B\. Chen, D\. Kalenichenko, W\. Wang, T\. Weyand, M\. Andreetto, and H\. Adam, “MobileNets: Efficient convolutional neural networks for mobile vision applications,”*arXiv preprint arXiv:1704\.04861*, 2017\.
- \[5\]Z\. Liu, H\. Mao, C\.\-Y\. Wu, C\. Feichtenhofer, T\. Darrell, and S\. Xie, “A ConvNet for the 2020s,” in*Proc\. IEEE/CVF Conf\. Comput\. Vis\. Pattern Recognit\. \(CVPR\)*, 2022, pp\. 11 976–11 986\.
- \[6\]C\. R\. Qi, H\. Su, K\. Mo, and L\. J\. Guibas, “PointNet: Deep learning on point sets for 3D classification and segmentation,” in*Proc\. IEEE Conf\. Comput\. Vis\. Pattern Recognit\. \(CVPR\)*, 2017, pp\. 652–660\.
- \[7\]C\. R\. Qi, L\. Yi, H\. Su, and L\. J\. Guibas, “PointNet\+\+: Deep hierarchical feature learning on point sets in a metric space,” in*Proc\. Adv\. Neural Inf\. Process\. Syst\. \(NeurIPS\)*, 2017, pp\. 5099–5108\.
- \[8\]R\. Pope, S\. Douglas, A\. Chowdhery, J\. Devlin, J\. Bradbury, J\. Heek, K\. Xiao, S\. Agrawal, and J\. Dean, “Efficiently scaling transformer inference,” in*Proc\. Mach\. Learn\. Syst\. \(MLSys\)*, 2023\.
- \[9\]S\. Han, H\. Mao, and W\. J\. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” in*Proc\. Int\. Conf\. Learn\. Represent\. \(ICLR\)*, 2016\.
- \[10\]G\. Hinton, O\. Vinyals, and J\. Dean, “Distilling the knowledge in a neural network,”*arXiv preprint arXiv:1503\.02531*, 2015\.
- \[11\]X\. Wang, F\. Yu, L\. Dunlap, Y\.\-A\. Ma, R\. Wang, A\. Mirhoseini, T\. Darrell, and J\. E\. Gonzalez, “Deep mixture of experts via shallow embedding,” in*Proc\. Conf\. Uncertainty Artif\. Intell\. \(UAI\)*, 2019\.
- \[12\]X\. Wang, F\. Yu, Z\.\-Y\. Dou, T\. Darrell, and J\. E\. Gonzalez, “SkipNet: Learning Dynamic Routing in Convolutional Networks,” Jul\. 2018\.
- \[13\]Y\. He, P\. Liu, Z\. Wang, Z\. Hu, and Y\. Yang, “Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration,” Jul\. 2019\.
- \[14\]M\. Lin, R\. Ji, Y\. Wang, Y\. Zhang, B\. Zhang, Y\. Tian, and L\. Shao, “HRank: Filter Pruning using High\-Rank Feature Map,” Mar\. 2020\.
- \[15\]S\. Hooker, A\. Courville, G\. Clark, Y\. Dauphin, and A\. Frome, “What do compressed deep neural networks forget?” in*arXiv Preprint arXiv:1911\.05248*, 2019\.
- \[16\]N\. Shazeer, A\. Mirhoseini, K\. Maziarz, A\. Davis, Q\. V\. Le, G\. Hinton, and J\. Dean, “Outrageously large neural networks: The sparsely\-gated mixture\-of\-experts layer,” in*Proc\. Int\. Conf\. Learn\. Represent\. \(ICLR\)*, 2017\.
- \[17\]D\. Dai, C\. Deng, C\. Zhao, R\. Xu, H\. Gao, D\. Chen, J\. Li, W\. Zeng, X\. Yu, Y\. Wu, Z\. Xie, Y\. Li, P\. Huang, F\. Luo, C\. Ruan, Z\. Sui, and W\. Liang, “DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture\-of\-Experts Language Models,” in*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, L\.\-W\. Ku, A\. Martins, and V\. Srikumar, Eds\. Bangkok, Thailand: Association for Computational Linguistics, Aug\. 2024, pp\. 1280–1297\.
- \[18\]Z\. Pei, H\.\-L\. Zhen, L\. Zou, X\. Yu, W\. Liu, S\. J\. Pan, M\. Yuan, and B\. Yu, “Analytical FFN\-to\-MoE Restructuring via Activation Pattern Analysis,” Apr\. 2026\.
- \[19\]N\. Gottlieb and M\. Werman, “DecisioNet: A Binary\-Tree Structured Neural Network,” in*Computer Vision – ACCV 2022*, L\. Wang, J\. Gall, T\.\-J\. Chin, I\. Sato, and R\. Chellappa, Eds\. Cham: Springer Nature Switzerland, 2023, vol\. 13841, pp\. 556–570\.
- \[20\]I\. S\. Dhillon and D\. S\. Modha, “Concept Decompositions for Large Sparse Text Data Using Clustering,”*Machine Learning*, vol\. 42, no\. 1, pp\. 143–175, Jan\. 2001\.
- \[21\]A\. Banerjee, I\. S\. Dhillon, J\. Ghosh, and S\. Sra, “Clustering on the Unit Hypersphere using von Mises\-Fisher Distributions,”*Journal of Machine Learning Research*, vol\. 6, no\. 46, pp\. 1345–1382, 2005\.
- \[22\]W\. B\. Johnson and J\. Lindenstrauss, “Extensions of Lipschitz mappings into a Hilbert space,” in*Contemporary Mathematics*, R\. Beals, A\. Beck, A\. Bellow, and A\. Hajian, Eds\. Providence, Rhode Island: American Mathematical Society, 1984, vol\. 26, pp\. 189–206\.
- \[23\]A\. Krizhevsky, “Learning Multiple Layers of Features from Tiny Images,”*Master’s thesis, University of Tront*, 2009\.
- \[24\]J\. Deng, W\. Dong, R\. Socher, L\.\-J\. Li, K\. Li, and L\. Fei\-Fei, “ImageNet: A large\-scale hierarchical image database,” in*2009 IEEE Conference on Computer Vision and Pattern Recognition*, Jun\. 2009, pp\. 248–255\.
- \[25\]Z\. Wu, S\. Song, A\. Khosla, F\. Yu, L\. Zhang, X\. Tang, and J\. Xiao, “3D ShapeNets: A deep representation for volumetric shapes,” in*2015 IEEE Conference on Computer Vision and Pattern Recognition \(CVPR\)*, Jun\. 2015, pp\. 1912–1920\.
- \[26\]I\. Loshchilov and F\. Hutter, “SGDR: Stochastic Gradient Descent with Warm Restarts,” May 2017\.
- \[27\]D\. P\. Kingma and J\. Ba, “Adam: A Method for Stochastic Optimization,” Jan\. 2017\.
![[Uncaptioned image]](https://arxiv.org/html/2606.09924v1/figures/bio_tanaka.jpg)KOHGA TANAKAreceived the B\.E\. degree from Keio University, Japan, in 2024, where he is currently pursuing the master’s degree\. His research interests include lightweight dynamic\-inference frameworks, active\-parameter reduction, and edge AI\.![[Uncaptioned image]](https://arxiv.org/html/2606.09924v1/figures/bio_nishi.jpg)HIROAKI NISHIHe has been a Researcher with the Real World Computing Partnership, since 1999, and with the Central Research Laboratory, Hitachi Ltd\., since 2002\. He has been a Professor with Keio University, since 2014\. He is the Chair of IEEE P21451\-1\-6 and IEEE P2992; and a member of IEEE 1451 Families, IEEE P2668, and IEEE P2805\. He was also a member of the ITU\-T Focus Group on Smart Sustainable Cities WG2\. He is a member of several committees established by the Ministry of Internal Affairs and Communications\. The main theme of his current research is a total network system that includes the development of hardware and software architecture\.Similar Articles
ResBM: a new transformer-based architecture for low-bandwidth pipeline-parallel training, achieving 128× activation compression [R]
ResBM introduces a transformer-based architecture with residual encoder-decoder bottlenecks for pipeline-parallel training, achieving 128× activation compression while maintaining convergence. The work advances decentralized, internet-grade distributed training by reducing inter-stage communication overhead.
Structured Neuron Pruning in Deep Neural Networks Using Multi-Armed Bandits
This paper proposes a novel structured neuron pruning framework for deep neural networks using multi-armed bandit algorithms, demonstrating effectiveness on various tasks.
Gradient Descent with Large Step Size Restores Symmetry in Deep Linear Networks with Multi-Pathway
This paper shows that discrete Gradient Descent with large step sizes restores symmetry in multi-pathway Deep Linear Networks, countering the symmetry-breaking predicted by Gradient Flow, and leads to signal re-balancing across pathways. The authors theoretically prove that balanced solutions are flatter (less sharp) than sparse ones, and large learning rates drive the network toward stable, balanced configurations.
Knowledge Offloading: Decomposing LLMs into Sparse Backbones and Memory Modules
Proposes KOFF, a framework that decomposes pretrained LLMs into a sparse shared backbone and domain-specific external memories using structured pruning and LoRA adapters, achieving 12% sparsity without significant performance loss.
SNLP: Layer-Parallel Inference via Structured Newton Corrections
This paper introduces SNLP, a framework that enables layer-parallel inference for transformers by replacing exact Newton corrections with structured approximations, achieving up to 2.3x speedup on a 0.5B model while improving perplexity.