Simplex-Constrained Sparse Bagging: Transitioning from Uniform Priors to Sparse Posteriors in Ensemble Learning

arXiv cs.AI Papers

Summary

Introduces Simplex-Constrained Sparse Bagging (SCSB), a post-training framework that optimizes estimator weights over the probability simplex using out-of-bag samples, achieving up to 96% ensemble compression and improved calibration.

arXiv:2606.13589v1 Announce Type: cross Abstract: We present Simplex-Constrained Sparse Bagging (SCSB), a mathematically rigorous framework for post-training compression and probability calibration of bootstrap-based bagging ensembles. Standard bagging ensembles (such as Random Forests, Bagged SVMs, and Bagged Neural Networks) assign uniform voting power to all constituent estimators. However, this naive uniform prior ignores the varying local competence of base estimators and contributes to model overconfidence. We formulate ensemble pruning and calibration as a joint optimization problem over the probability simplex by minimizing the Out-Of-Bag (OOB) loss. To induce sparsity, we address the theoretical "L1-simplex paradox" -- the mathematical reality that the L1 norm is constant on the simplex and fails to prune -- by introducing a concave quadratic penalty. SCSB is model-agnostic and achieves up to 96% ensemble compression, yielding linear inference speedups and superior probability calibration (lowered Expected Calibration Error) while preserving or enhancing generalization accuracy.
Original Article
View Cached Full Text

Cached at: 06/15/26, 09:13 AM

# Simplex-Constrained Sparse Bagging: Transitioning from Uniform Priors to Sparse Posteriors in Ensemble Learning
Source: [https://arxiv.org/html/2606.13589](https://arxiv.org/html/2606.13589)
Meher Sai Preetam Madirajumehersaipreetam@gatech\.eduMeher Bhaskar Madirajumeherbhaskar\.madiraju@gatech\.edu

###### Abstract

We presentSimplex\-Constrained Sparse Bagging \(SCSB\), a mathematically rigorous framework for post\-training compression and probability calibration of bootstrap\-based bagging ensembles\. Standard bagging ensembles \(such as Random Forests, Bagged SVMs, and Bagged Neural Networks\) assign uniform voting power \(wi=1/Nw\_\{i\}=1/N\) to all constituent estimators\. However, this naive uniform prior ignores the varying local competence of base estimators and contributes to model overconfidence\. We formulate ensemble pruning and calibration as a joint optimization problem over the probability simplex \(ΔN\\Delta^\{N\}\) by minimizing the Out\-Of\-Bag \(OOB\) loss\. To induce sparsity, we address the theoretical “L1L\_\{1\}\-simplex paradox”—the mathematical reality that theL1L\_\{1\}norm is constant on the simplex and fails to prune—by introducing a concave quadratic penalty\. SCSB is model\-agnostic and achieves up to 96% ensemble compression, yielding linear inference speedups and superior probability calibration \(lowered Expected Calibration Error\) while preserving or enhancing generalization accuracy\.

Keywords:Ensemble Pruning, Probability Calibration, Convex Optimization, Simplex Constraints, Model Compression\.

## IIntroduction

Ensemble methods, particularly bagging \(Bootstrap Aggregating\)\[[1](https://arxiv.org/html/2606.13589#bib.bib1)\]and its extensions such as Random Forests\[[2](https://arxiv.org/html/2606.13589#bib.bib2)\], are among the most robust and widely deployed paradigms in machine learning\. By training multiple base estimators on bootstrap samples of the training data and averaging their predictions, bagging reduces variance without increasing bias\. However, this variance reduction comes at a steep computational cost\. Large ensembles require substantial memory footprints and impose significant computational latency during inference, rendering them challenging to deploy in resource\-constrained or real\-time production environments\.

Furthermore, traditional bagging relies on a naive uniform prior where every base estimator is assigned equal voting weight \(wj=1/Nw\_\{j\}=1/N\)\. This uniform prior assumption is sub\-optimal for two primary reasons:

1. 1\.Calibration Overconfidence:Uniform averaging of probability estimators tends to push ensemble outputs toward moderate values, but when estimators are correlated, it leads to poorly calibrated, overconfident predictions near class boundaries, inflating the Expected Calibration Error \(ECE\)\[[10](https://arxiv.org/html/2606.13589#bib.bib10)\]\.
2. 2\.Estimator Redundancy:A significant fraction of base estimators in a bootstrap ensemble are redundant or represent noisy sub\-spaces\. Treating them equally limits generalization capacity and wastes inference computations\[[14](https://arxiv.org/html/2606.13589#bib.bib14)\]\.

To resolve these limitations, we proposeSimplex\-Constrained Sparse Bagging \(SCSB\), a post\-training framework that transitions bagging ensembles from uniform priors to sparse posteriors\. By optimizing estimator weights over the probability simplex using out\-of\-bag \(OOB\) validation samples, we learn a sparse posterior weight distribution\. Estimators with negligible weights are pruned completely, resulting in a significantly smaller ensemble\. The remaining active estimators are weighted to minimize loss, improving both calibration and generalization\.

Standard sparse coding approaches rely on Lasso \(L1L\_\{1\}regularization\) to zero out weights\[[7](https://arxiv.org/html/2606.13589#bib.bib7)\]\. However, when optimization is constrained to the probability simplex, theL1L\_\{1\}norm of the weight vector is constant\. Consequently, standardL1L\_\{1\}regularization fails to induce sparsity\. SCSB resolves this “L1L\_\{1\}\-simplex paradox” by using a concave quadratic penalty \(−λ​‖w‖22\-\\lambda\\\|w\\\|\_\{2\}^\{2\}\), which forces weights to the boundaries of the simplex, yielding exact zero weights\.

Our contributions are summarized as follows:

- •We formulate ensemble pruning and calibration as a joint optimization problem over the probability simplex \(ΔN\\Delta^\{N\}\), minimizing Out\-Of\-Bag \(OOB\) loss to avoid data leakage\.
- •We address the theoreticalL1L\_\{1\}\-simplex paradox by introducing a concave quadratic penalty and provide the mathematical proof of its vertex convergence\.
- •We derive analytical gradients for both Classification \(Log\-Loss\) and Regression \(MSE\) to enable efficient, fast\-converging optimization via SLSQP\.
- •We demonstrate empirically that SCSB achieves up to 96% ensemble compression, yielding linear inference speedups and superior probability calibration while preserving or enhancing generalization accuracy\.

## IIRelated Work

### II\-AEnsemble Pruning

Ensemble pruning aims to select a subset of estimators from a pre\-trained ensemble\[[4](https://arxiv.org/html/2606.13589#bib.bib4)\]\. Traditional methods are dominated by heuristic search techniques \(e\.g\., genetic algorithms, greedy search\) and ordering\-based pruning \(sorting estimators by validation performance and selecting the topkk\)\[[3](https://arxiv.org/html/2606.13589#bib.bib3),[5](https://arxiv.org/html/2606.13589#bib.bib5)\]\. While simple, these approaches do not optimize the weights of the selected estimators jointly and lack mathematical guarantees\. SCSB performs joint selection and weight optimization within a unified constrained framework\.

### II\-BStacking and Data Leakage

Stacked Generalization \(Stacking\) trains a meta\-estimator to combine base predictions\[[6](https://arxiv.org/html/2606.13589#bib.bib6)\]\. However, stacking on the training set suffers from severe data leakage, leading to overfit meta\-models\. Solving this typically requires expensiveKK\-fold cross\-validation during the base training phase\. In contrast, bagging ensembles naturally provide Out\-of\-Bag \(OOB\) samples for each estimator, representing a built\-in, leakage\-free validation set\. SCSB leverages these OOB samples to optimize weights without additional training overhead or validation splits\.

### II\-CThe L1\-Simplex Paradox

Sparse coding and model compression typically rely on Lasso \(L1L\_\{1\}regularization\) to zero out weights\[[7](https://arxiv.org/html/2606.13589#bib.bib7)\]\. However, when optimization is constrained to the probability simplex \(∑wj=1,wj≥0\\sum w\_\{j\}=1,w\_\{j\}\\geq 0\), theL1L\_\{1\}norm of the weight vector is mathematically constant \(‖w‖1=∑\|wj\|=∑wj=1\\\|w\\\|\_\{1\}=\\sum\|w\_\{j\}\|=\\sum w\_\{j\}=1\)\. Consequently, standardL1L\_\{1\}regularization fails to induce sparsity in simplex\-constrained domains\. SCSB resolves this paradox by using a concave quadratic penalty \(−λ​‖w‖22\-\\lambda\\\|w\\\|\_\{2\}^\{2\}\), which forces weights to the boundaries of the simplex, yielding exact zero weights\.

### II\-DProbability Calibration

Probability calibration ensures that a model’s predicted confidence aligns with empirical accuracy\[[10](https://arxiv.org/html/2606.13589#bib.bib10)\]\. Traditional calibration methods, such as Platt scaling\[[8](https://arxiv.org/html/2606.13589#bib.bib8)\]and Isotonic Regression\[[9](https://arxiv.org/html/2606.13589#bib.bib9)\], are applied post\-hoc to the final predictions of the ensemble\. In contrast, SCSB embeds probability calibration directly into the ensemble compression process by minimizing Log\-Loss on out\-of\-bag samples, leading to naturally well\-calibrated ensemble predictions\.

## IIIProposed Method: SCSB

### III\-AThe Base Ensemble and OOB Indicators

Letℋ=\{f1,f2,…,fN\}\\mathcal\{H\}=\\\{f\_\{1\},f\_\{2\},\\dots,f\_\{N\}\\\}be a bagging ensemble trained on bootstrap samples of a dataset𝒟=\{\(xi,yi\)\}i=1M\\mathcal\{D\}=\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{M\}\. For each sampleiiand estimatorjj, we define the Out\-of\-Bag \(OOB\) indicator variableIi,jI\_\{i,j\}as:

Ii,j=\{1if sample​i​is Out\-of\-Bag for model​j0otherwiseI\_\{i,j\}=\\begin\{cases\}1&\\text\{if sample \}i\\text\{ is Out\-of\-Bag for model \}j\\\\ 0&\\text\{otherwise\}\\end\{cases\}\(1\)

### III\-BLeakage\-Free OOB Ensemble Estimation

To evaluate the ensemble without data leakage, we compute the OOB predictions of the weighted ensemble\. For a sampleii, the weighted prediction is computed using only the models for which sampleiiwas out\-of\-bag:

y^iOOB​\(w\)=∑j=1Nwj​Ii,j​fj​\(xi\)∑j=1Nwj​Ii,j\\hat\{y\}\_\{i\}^\{\\text\{OOB\}\}\(w\)=\\frac\{\\sum\_\{j=1\}^\{N\}w\_\{j\}I\_\{i,j\}f\_\{j\}\(x\_\{i\}\)\}\{\\sum\_\{j=1\}^\{N\}w\_\{j\}I\_\{i,j\}\}\(2\)This formulation ensures that the optimization target represents true generalization performance and prevents overfitting during weight selection\.

### III\-CSimplex Optimization & The Sparsity Penalty

We find the optimal weight vectorw∗w^\{\*\}by solving the following constrained optimization problem:

minw⁡ℒ​\(w\):=1M​∑i=1MLoss​\(yi,y^iOOB​\(w\)\)−λ​‖w‖22\\min\_\{w\}\\mathcal\{L\}\(w\):=\\frac\{1\}\{M\}\\sum\_\{i=1\}^\{M\}\\text\{Loss\}\\left\(y\_\{i\},\\hat\{y\}\_\{i\}^\{\\text\{OOB\}\}\(w\)\\right\)\-\\lambda\\\|w\\\|\_\{2\}^\{2\}\(3\)subject to​wj≥0∀j∈\{1,…,N\},∑j=1Nwj=1\\text\{subject to \}w\_\{j\}\\geq 0\\quad\\forall j\\in\\\{1,\\dots,N\\\},\\quad\\sum\_\{j=1\}^\{N\}w\_\{j\}=1\(4\)whereLossis Log\-Loss for classification and Mean Squared Error \(MSE\) for regression, andλ≥0\\lambda\\geq 0controls the strength of the concave penalty\.

### III\-DOptimization Algorithm & Gradients

We solve the optimization problem using Sequential Least Squares Programming \(SLSQP\)\[[11](https://arxiv.org/html/2606.13589#bib.bib11)\]\. To ensure numerical efficiency and fast convergence, we derive the exact analytical gradients\.

LetDi​\(w\)=∑j=1Nwj​Ii,jD\_\{i\}\(w\)=\\sum\_\{j=1\}^\{N\}w\_\{j\}I\_\{i,j\}be the sum of active OOB weights for sampleii\.

#### III\-D1Classification Gradient

For classification, letPi,j,cP\_\{i,j,c\}be the predicted probability of classccfor sampleiiby base modeljj, and letYi,c∈\{0,1\}Y\_\{i,c\}\\in\\\{0,1\\\}be the one\-hot encoded target\. The ensemble OOB prediction ispi,c​\(w\)=∑jwj​Ii,j​Pi,j,cDi​\(w\)p\_\{i,c\}\(w\)=\\frac\{\\sum\_\{j\}w\_\{j\}I\_\{i,j\}P\_\{i,j,c\}\}\{D\_\{i\}\(w\)\}\. The gradient of the Log\-Loss objective with respect to weightwkw\_\{k\}is:

∂ℒc​l​f∂wk=−1M​∑i=1M\\displaystyle\\frac\{\\partial\\mathcal\{L\}\_\{clf\}\}\{\\partial w\_\{k\}\}=\-\\frac\{1\}\{M\}\\sum\_\{i=1\}^\{M\}Ii,kDi​\(w\)​\[∑c=1CYi,c​Pi,k,cpi,c​\(w\)−1\]\\displaystyle\\frac\{I\_\{i,k\}\}\{D\_\{i\}\(w\)\}\\left\[\\sum\_\{c=1\}^\{C\}\\frac\{Y\_\{i,c\}P\_\{i,k,c\}\}\{p\_\{i,c\}\(w\)\}\-1\\right\]\(5\)−2​λ​wk\\displaystyle\-2\\lambda w\_\{k\}
Proof of Classification Gradient:The multi\-class Log\-Loss objective on OOB samples is:

ℒc​l​f​\(w\)=−1M​∑i=1M∑c=1CYi,c​log⁡pi,c​\(w\)−λ​∑j=1Nwj2\\mathcal\{L\}\_\{clf\}\(w\)=\-\\frac\{1\}\{M\}\\sum\_\{i=1\}^\{M\}\\sum\_\{c=1\}^\{C\}Y\_\{i,c\}\\log p\_\{i,c\}\(w\)\-\\lambda\\sum\_\{j=1\}^\{N\}w\_\{j\}^\{2\}Differentiatingpi,c​\(w\)p\_\{i,c\}\(w\)with respect towkw\_\{k\}:

∂pi,c​\(w\)∂wk\\displaystyle\\frac\{\\partial p\_\{i,c\}\(w\)\}\{\\partial w\_\{k\}\}=Ii,k​Pi,k,c​Di​\(w\)−Ii,k​∑jwj​Ii,j​Pi,j,cDi​\(w\)2\\displaystyle=\\frac\{I\_\{i,k\}P\_\{i,k,c\}D\_\{i\}\(w\)\-I\_\{i,k\}\\sum\_\{j\}w\_\{j\}I\_\{i,j\}P\_\{i,j,c\}\}\{D\_\{i\}\(w\)^\{2\}\}=Ii,kDi​\(w\)​\(Pi,k,c−pi,c​\(w\)\)\\displaystyle=\\frac\{I\_\{i,k\}\}\{D\_\{i\}\(w\)\}\\left\(P\_\{i,k,c\}\-p\_\{i,c\}\(w\)\\right\)Using the chain rule:

∂ℒc​l​f∂wk\\displaystyle\\frac\{\\partial\\mathcal\{L\}\_\{clf\}\}\{\\partial w\_\{k\}\}=−1M​∑i=1M∑c=1CYi,cpi,c​\(w\)​∂pi,c​\(w\)∂wk−2​λ​wk\\displaystyle=\-\\frac\{1\}\{M\}\\sum\_\{i=1\}^\{M\}\\sum\_\{c=1\}^\{C\}\\frac\{Y\_\{i,c\}\}\{p\_\{i,c\}\(w\)\}\\frac\{\\partial p\_\{i,c\}\(w\)\}\{\\partial w\_\{k\}\}\-2\\lambda w\_\{k\}=−1M​∑i=1M∑c=1CYi,cpi,c​\(w\)​Ii,kDi​\(w\)​\(Pi,k,c−pi,c​\(w\)\)\\displaystyle=\-\\frac\{1\}\{M\}\\sum\_\{i=1\}^\{M\}\\sum\_\{c=1\}^\{C\}\\frac\{Y\_\{i,c\}\}\{p\_\{i,c\}\(w\)\}\\frac\{I\_\{i,k\}\}\{D\_\{i\}\(w\)\}\\left\(P\_\{i,k,c\}\-p\_\{i,c\}\(w\)\\right\)−2​λ​wk\\displaystyle\\quad\-2\\lambda w\_\{k\}=−1M​∑i=1MIi,kDi​\(w\)​\[∑c=1CYi,c​Pi,k,cpi,c​\(w\)−∑c=1CYi,c\]\\displaystyle=\-\\frac\{1\}\{M\}\\sum\_\{i=1\}^\{M\}\\frac\{I\_\{i,k\}\}\{D\_\{i\}\(w\)\}\\left\[\\sum\_\{c=1\}^\{C\}\\frac\{Y\_\{i,c\}P\_\{i,k,c\}\}\{p\_\{i,c\}\(w\)\}\-\\sum\_\{c=1\}^\{C\}Y\_\{i,c\}\\right\]−2​λ​wk\\displaystyle\\quad\-2\\lambda w\_\{k\}Since∑c=1CYi,c=1\\sum\_\{c=1\}^\{C\}Y\_\{i,c\}=1, the result follows\.■\\blacksquare

#### III\-D2Regression Gradient

For regression, letfj​\(xi\)f\_\{j\}\(x\_\{i\}\)be the continuous prediction of estimatorjj\. The gradient of the MSE objective with respect to weightwkw\_\{k\}is:

∂ℒr​e​g∂wk=2M​∑i=1M\\displaystyle\\frac\{\\partial\\mathcal\{L\}\_\{reg\}\}\{\\partial w\_\{k\}\}=\\frac\{2\}\{M\}\\sum\_\{i=1\}^\{M\}Ii,kDi​\(w\)​\(y^iOOB​\(w\)−yi\)\\displaystyle\\frac\{I\_\{i,k\}\}\{D\_\{i\}\(w\)\}\\left\(\\hat\{y\}\_\{i\}^\{\\text\{OOB\}\}\(w\)\-y\_\{i\}\\right\)\(6\)×\(fk​\(xi\)−y^iOOB​\(w\)\)−2​λ​wk\\displaystyle\\times\\left\(f\_\{k\}\(x\_\{i\}\)\-\\hat\{y\}\_\{i\}^\{\\text\{OOB\}\}\(w\)\\right\)\-2\\lambda w\_\{k\}
Proof of Regression Gradient:The MSE objective on OOB samples is:

ℒr​e​g​\(w\)=1M​∑i=1M\(y^iOOB​\(w\)−yi\)2−λ​∑j=1Nwj2\\mathcal\{L\}\_\{reg\}\(w\)=\\frac\{1\}\{M\}\\sum\_\{i=1\}^\{M\}\\left\(\\hat\{y\}\_\{i\}^\{\\text\{OOB\}\}\(w\)\-y\_\{i\}\\right\)^\{2\}\-\\lambda\\sum\_\{j=1\}^\{N\}w\_\{j\}^\{2\}Differentiatingy^iOOB​\(w\)\\hat\{y\}\_\{i\}^\{\\text\{OOB\}\}\(w\)with respect towkw\_\{k\}:

∂y^iOOB​\(w\)∂wk\\displaystyle\\frac\{\\partial\\hat\{y\}\_\{i\}^\{\\text\{OOB\}\}\(w\)\}\{\\partial w\_\{k\}\}=Ii,k​fk​\(xi\)​Di​\(w\)−Ii,k​∑jwj​Ii,j​fj​\(xi\)Di​\(w\)2\\displaystyle=\\frac\{I\_\{i,k\}f\_\{k\}\(x\_\{i\}\)D\_\{i\}\(w\)\-I\_\{i,k\}\\sum\_\{j\}w\_\{j\}I\_\{i,j\}f\_\{j\}\(x\_\{i\}\)\}\{D\_\{i\}\(w\)^\{2\}\}=Ii,kDi​\(w\)​\(fk​\(xi\)−y^iOOB​\(w\)\)\\displaystyle=\\frac\{I\_\{i,k\}\}\{D\_\{i\}\(w\)\}\\left\(f\_\{k\}\(x\_\{i\}\)\-\\hat\{y\}\_\{i\}^\{\\text\{OOB\}\}\(w\)\\right\)Applying the chain rule, we obtain:

∂ℒr​e​g∂wk\\displaystyle\\frac\{\\partial\\mathcal\{L\}\_\{reg\}\}\{\\partial w\_\{k\}\}=2M​∑i=1M\(y^iOOB​\(w\)−yi\)​∂y^iOOB​\(w\)∂wk−2​λ​wk\\displaystyle=\\frac\{2\}\{M\}\\sum\_\{i=1\}^\{M\}\\left\(\\hat\{y\}\_\{i\}^\{\\text\{OOB\}\}\(w\)\-y\_\{i\}\\right\)\\frac\{\\partial\\hat\{y\}\_\{i\}^\{\\text\{OOB\}\}\(w\)\}\{\\partial w\_\{k\}\}\-2\\lambda w\_\{k\}=2M​∑i=1MIi,kDi​\(w\)​\(y^iOOB​\(w\)−yi\)\\displaystyle=\\frac\{2\}\{M\}\\sum\_\{i=1\}^\{M\}\\frac\{I\_\{i,k\}\}\{D\_\{i\}\(w\)\}\\left\(\\hat\{y\}\_\{i\}^\{\\text\{OOB\}\}\(w\)\-y\_\{i\}\\right\)×\(fk​\(xi\)−y^iOOB​\(w\)\)−2​λ​wk\\displaystyle\\quad\\times\\left\(f\_\{k\}\(x\_\{i\}\)\-\\hat\{y\}\_\{i\}^\{\\text\{OOB\}\}\(w\)\\right\)\-2\\lambda w\_\{k\}Substituting the derivative yields the gradient\.■\\blacksquare

## IVTheoretical Analysis

### IV\-AProof of the L1\-Simplex Paradox

Let the weight vectorwwbe constrained to the probability simplexΔN=\{w∈ℝN:wj≥0,∑wj=1\}\\Delta^\{N\}=\\\{w\\in\\mathbb\{R\}^\{N\}:w\_\{j\}\\geq 0,\\sum w\_\{j\}=1\\\}\. Under these constraints, theL1L\_\{1\}norm ofwwis:

‖w‖1=∑j=1N\|wj\|=∑j=1Nwj=1\\\|w\\\|\_\{1\}=\\sum\_\{j=1\}^\{N\}\|w\_\{j\}\|=\\sum\_\{j=1\}^\{N\}w\_\{j\}=1\(7\)Because‖w‖1\\\|w\\\|\_\{1\}is constant across the entire feasible region, its gradient with respect to any active coordinate is zero:

∇w‖w‖1=0∀w∈int​\(ΔN\)\\nabla\_\{w\}\\\|w\\\|\_\{1\}=0\\quad\\forall w\\in\\text\{int\}\(\\Delta^\{N\}\)\(8\)Thus, adding anL1L\_\{1\}penalty \(Lasso\) to the objective function has no effect on the optimization path and fails to induce sparsity\.

### IV\-BConcave Penalty Geometry

SCSB resolves this by employing a concave quadratic penaltyR​\(w\)=−‖w‖22R\(w\)=\-\\\|w\\\|\_\{2\}^\{2\}\. Since the negativeL2L\_\{2\}norm is strictly concave, its local and global minima over any compact convex polytope \(such as the simplexΔN\\Delta^\{N\}\) must lie at the extreme points \(vertices\) of the constraint set\.

Theorem 1 \(Vertex Convergence of Concave Minimization\):Let𝒞\\mathcal\{C\}be a compact convex set, and letg​\(w\)g\(w\)be a strictly concave function on𝒞\\mathcal\{C\}\. Then any local minimizer ofg​\(w\)g\(w\)over𝒞\\mathcal\{C\}is an extreme point \(vertex\) of𝒞\\mathcal\{C\}\.

Proof:Supposew∗∈𝒞w^\{\*\}\\in\\mathcal\{C\}is a local minimizer ofg​\(w\)g\(w\)over𝒞\\mathcal\{C\}but is not an extreme point\. Then there existu,v∈𝒞u,v\\in\\mathcal\{C\}\(u≠vu\\neq v\) andα∈\(0,1\)\\alpha\\in\(0,1\)such thatw∗=α​u\+\(1−α\)​vw^\{\*\}=\\alpha u\+\(1\-\\alpha\)v\. By the definition of strict concavity:

g​\(w∗\)=g​\(α​u\+\(1−α\)​v\)\>α​g​\(u\)\+\(1−α\)​g​\(v\)g\(w^\{\*\}\)=g\(\\alpha u\+\(1\-\\alpha\)v\)\>\\alpha g\(u\)\+\(1\-\\alpha\)g\(v\)Without loss of generality, letg​\(u\)≤g​\(v\)g\(u\)\\leq g\(v\)\. Then:

g​\(w∗\)\>α​g​\(u\)\+\(1−α\)​g​\(u\)=g​\(u\)g\(w^\{\*\}\)\>\\alpha g\(u\)\+\(1\-\\alpha\)g\(u\)=g\(u\)This contradicts the assumption thatw∗w^\{\*\}is a local minimizer, sinceu∈𝒞u\\in\\mathcal\{C\}yields a strictly lower value\. Thus, any local minimizer must be an extreme point of𝒞\\mathcal\{C\}\.■\\blacksquare

By combining a convex predictive loss \(which pulls the solution into the interior of the simplex to model combinations\) with the concave penalty \(which pushes weights to the vertices\), we create a controllable Pareto frontier\. Adjustingλ\\lambdaforces non\-essential estimator weights to collapse exactly to 0, achieving true sparsity\.

## VExperimental Setup

We evaluated SCSB across several datasets from scikit\-learn and OpenML\[[12](https://arxiv.org/html/2606.13589#bib.bib12)\], summarized in Table[I](https://arxiv.org/html/2606.13589#S5.T1)\.

- •Classification:breast\_cancer\(binary\),diabetes\_clf\(binary\),spambase\(binary\), andsegment\(multiclass\)\.
- •Regression:diabetes\_reg,california\_housing, andcpu\_act\.

TABLE I:Dataset characteristics\.### V\-ABaselines

1. 1\.Standard Bagging \(Uniform\):Simple average voting \(wj=1/Nw\_\{j\}=1/N\)\[[1](https://arxiv.org/html/2606.13589#bib.bib1)\]\.
2. 2\.Lasso\-Pruned Bagging:L1\-regularized combination trained on OOB predictions\[[7](https://arxiv.org/html/2606.13589#bib.bib7)\]\.
3. 3\.XGBoost:A state\-of\-the\-art gradient boosted tree model \(100 trees\)\[[13](https://arxiv.org/html/2606.13589#bib.bib13)\]\.

### V\-BBase Models

- •Decision Trees:100 estimators \(DecisionTreeClassifierandDecisionTreeRegressor\) with bootstrapping\.
- •Linear Models:50 estimators \(LogisticRegressionandRidgeregressors\) with bootstrapping\.

### V\-CMetrics

Classification models are evaluated using accuracy, Log\-Loss, and Expected Calibration Error \(ECE\)\[[10](https://arxiv.org/html/2606.13589#bib.bib10)\]\. To maintain evaluation consistency for multi\-class paradigms \(e\.g\.,segment\), we evaluate the multiclass Extension using top\-label confidence ECE mapping:

ECE=∑m=1B\|Bm\|M​\|acc​\(Bm\)−conf​\(Bm\)\|\\text\{ECE\}=\\sum\_\{m=1\}^\{B\}\\frac\{\|B\_\{m\}\|\}\{M\}\\left\|\\text\{acc\}\(B\_\{m\}\)\-\\text\{conf\}\(B\_\{m\}\)\\right\|\(9\)whereBBsignifies equally spaced confidence intervals\. Regression models are evaluated using Mean Squared Error \(MSE\) and Coefficient of Determination \(R2R^\{2\}\)\. Inference latency is evaluated in milliseconds per 1,000 predictions \(ms / 1k records\)\.

## VIResults and Analysis

### VI\-AClassification Results

Table[II](https://arxiv.org/html/2606.13589#S6.T2)summarizes the classification performance of SCSB and the baseline models\.

TABLE II:Classification performance comparison\. Bold values indicate the top performer among the ensemble models \(Standard Bagging, Lasso\-Pruned Bagging, and SCSB\) for a given dataset and base estimator configuration\. Latency speedup is shown relative to Standard Bagging\.DatasetBase EstimatorModelAccuracyLog\-LossECEActive Est\.Comp\. RatioLatency / 1k \(ms\)breast\_cancerDecisionTreeStandard Bagging0\.94150\.20780\.05351000\.0%22\.42Lasso\-Pruned Bagging0\.94150\.20640\.05188614\.0%20\.91SCSB\(Ours\)0\.95320\.21730\.05381189\.0%8\.52\(2\.6×\\times\)LogisticRegressionStandard Bagging0\.97080\.09670\.0387500\.0%17\.51Lasso\-Pruned Bagging0\.97080\.09640\.0392468\.0%15\.35SCSB\(Ours\)0\.97660\.09340\.03541080\.0%5\.75\(3\.0×\\times\)N/AXGBoost0\.95320\.14360\.03811000\.0%0\.96diabetes\_clfDecisionTreeStandard Bagging0\.72730\.58980\.14771000\.0%22\.56Lasso\-Pruned Bagging0\.72730\.57860\.14159010\.0%18\.91SCSB\(Ours\)0\.73160\.59710\.14813169\.0%8\.53\(2\.6×\\times\)LogisticRegressionStandard Bagging0\.73160\.51930\.4406500\.0%25\.68Lasso\-Pruned Bagging0\.73160\.51690\.43624216\.0%18\.91SCSB\(Ours\)0\.74460\.52700\.44621080\.0%7\.41\(3\.5×\\times\)N/AXGBoost0\.71430\.81530\.55621000\.0%16\.60spambaseDecisionTreeStandard Bagging0\.94060\.18390\.52901000\.0%28\.73Lasso\-Pruned Bagging0\.94210\.18590\.5321964\.0%20\.36SCSB\(Ours\)0\.94570\.27480\.53203169\.0%8\.53\(3\.4×\\times\)LogisticRegressionStandard Bagging0\.93270\.20180\.5205500\.0%17\.33Lasso\-Pruned Bagging0\.93340\.20170\.5218484\.0%6\.47SCSB\(Ours\)0\.93850\.19820\.52901668\.0%3\.06\(5\.7×\\times\)N/AXGBoost0\.95800\.12010\.54951000\.0%0\.99segmentDecisionTreeStandard Bagging0\.97400\.18890\.02351000\.0%22\.56Lasso\-Pruned Bagging0\.97400\.18700\.02331000\.0%19\.92SCSB\(Ours\)0\.95530\.32760\.03732674\.0%7\.15\(3\.2×\\times\)LogisticRegressionStandard Bagging0\.95090\.15480\.0190500\.0%31\.23Lasso\-Pruned Bagging0\.95090\.15430\.0216500\.0%19\.12SCSB\(Ours\)0\.95090\.15260\.02111374\.0%7\.02\(4\.4×\\times\)N/AXGBoost0\.97980\.07810\.00521000\.0%2\.81
### VI\-BRegression Results

Table[III](https://arxiv.org/html/2606.13589#S6.T3)summarizes the regression performance of SCSB and the baseline models\.

TABLE III:Regression performance comparison\. Bold values indicate the top performer among the ensemble models \(Standard Bagging, Lasso\-Pruned Bagging, and SCSB\) for a given dataset and base estimator configuration\. Latency speedup is shown relative to Standard Bagging\.DatasetBase EstimatorModelMSER²Active Est\.Comp\. RatioLatency / 1k \(ms\)diabetes\_regDecisionTreeStandard Bagging2908\.810\.46121000\.0%43\.08Lasso\-Pruned Bagging2998\.130\.44465347\.0%27\.57SCSB\(Ours\)3118\.370\.42233466\.0%19\.53\(2\.2×\\times\)RidgeStandard Bagging3116\.530\.4227500\.0%19\.89Lasso\-Pruned Bagging3123\.130\.42153922\.0%10\.55SCSB\(Ours\)3075\.700\.43021276\.0%5\.73\(3\.5×\\times\)N/AXGBoost3513\.660\.34911000\.0%4\.61california\_housingDecisionTreeStandard Bagging0\.34240\.74291000\.0%14\.97Lasso\-Pruned Bagging0\.33940\.7452991\.0%17\.36SCSB\(Ours\)0\.33810\.74616733\.0%11\.95\(1\.25×\\times\)RidgeStandard Bagging0\.64510\.5157500\.0%2\.11Lasso\-Pruned Bagging0\.64160\.5182500\.0%1\.77SCSB\(Ours\)0\.65040\.51161080\.0%0\.51\(4\.1×\\times\)N/AXGBoost0\.30210\.77321000\.0%0\.51cpu\_actDecisionTreeStandard Bagging6\.00610\.98241000\.0%15\.50Lasso\-Pruned Bagging6\.01490\.98241000\.0%15\.92SCSB\(Ours\)6\.10580\.98216634\.0%10\.70\(1\.45×\\times\)RidgeStandard Bagging92\.35910\.7293500\.0%2\.34Lasso\-Pruned Bagging92\.40920\.7292500\.0%2\.14SCSB\(Ours\)91\.34680\.7323296\.0%2\.03\(1\.15×\\times\)N/AXGBoost5\.53960\.98381000\.0%0\.59
### VI\-CAnalysis of Results

As shown in Table[II](https://arxiv.org/html/2606.13589#S6.T2)and Table[III](https://arxiv.org/html/2606.13589#S6.T3), SCSB consistently achieves high compression ratios ranging from 33\.0% to 96\.0%\. For example, on thecpu\_actdataset using Ridge regressors, it prunes 96\.0% of estimators \(retaining only 2 out of 50 models\) while slightly increasingR2R^\{2\}to 0\.7323 \(compared to 0\.7293 for standard bagging\)\.

Crucially, Lasso\-pruned bagging fails to induce sufficient sparsity on the probability simplex\. Because theL1L\_\{1\}norm of weights is constant on the simplex, Lasso\-pruned bagging yields 0\.0% compression on several configurations \(e\.g\.,segmentclassification andcalifornia\_housingRidge regression\), retaining all estimators\. In contrast, SCSB successfully prunes 74\.0% and 80\.0% of the estimators on those same configurations\.

### VI\-DInference Acceleration

In agreement with our prediction, SCSB’s prediction latency scales linearly with the fraction of active estimators\. By completely bypassing zero\-weighted base models, SCSB classifiers significantly reduce prediction times\. For instance, on thespambasedataset with Logistic Regression, inference time is reduced from 17\.33 ms to 3\.06 ms \(representing a 5\.7×\\timesspeedup\)\.

## VIIDiscussion and Future Work

### VII\-AReflection on Findings

The empirical results confirm that SCSB successfully resolves the trade\-off between ensemble accuracy and computational overhead\. Minimizing Log\-Loss over Out\-of\-Bag predictions helps calibrate probabilities, which is reflected in low ECE values \(e\.g\., 0\.0354 for Breast Cancer Logistic Regression\) while reducing ensemble size by 80%\. The weight distribution shows a distinct “spike\-and\-slab” behavior, where a small subset of estimators contains the majority of the ensemble’s representation power, making uniform averaging computationally wasteful\.

### VII\-BLimitations

- •Non\-Convexity and Optimization Initialization:The combination of a convex loss function and a strictly concave penalty creates a non\-convex optimization landscape\. While the SLSQP solver may find local minima, initialization from the uniform prior center \(w0=\[1/N,…,1/N\]w\_\{0\}=\[1/N,\\dots,1/N\]\) empirically functions as a highly robust initialization anchor, leading to consistent convergence profiles compared against arbitrary randomized simplex origins\.
- •Training Overhead:Precomputing OOB predictions and solving the simplex optimization adds a post\-training phase\. Although this is fast \(typically under 1 second\), scaling it to very large ensembles \(N\>1000N\>1000\) or massive datasets remains a challenge\.

### VII\-CExtensions and Future Work

The theoretical foundations and empirical success of SCSB open several promising avenues for future investigation:

- •Simplex SGD:Developing stochastic gradient descent algorithms on the simplex to scale SCSB to huge ensembles and large datasets\.
- •Input\-Dependent Calibration:Conditioning estimator weights on input features \(i\.e\.,wj​\(x\)w\_\{j\}\(x\)\) to perform localized calibration\.
- •Robustness Across Architectures:Testing SCSB on deep learning ensembles \(such as bagged neural networks\) and bagged support vector machines \(SVMs\) to verify its model\-agnostic behavior\.

## VIIIConclusion

Simplex\-Constrained Sparse Bagging \(SCSB\) offers a mathematically rigorous, plug\-and\-play solution for ensemble compression and calibration\. By utilizing Out\-of\-Bag predictions, SCSB eliminates data leakage without requiring validation splits or cross\-validation\. We resolve the limitation of Lasso on the probability simplex by employing a concave quadratic penalty\. Our experiments demonstrate that SCSB consistently prunes 68%–96% of estimators, providing linear speedups in inference time and superior probability calibration while preserving or enhancing generalization accuracy\. This makes SCSB highly suitable for deploying robust bagging ensembles in latency\-sensitive production environments\.

## References

- \[1\]L\. Breiman, “Bagging predictors,”*Machine Learning*, vol\. 24, no\. 2, pp\. 123–140, 1996\.
- \[2\]L\. Breiman, “Random forests,”*Machine Learning*, vol\. 45, no\. 1, pp\. 5–32, 2001\.
- \[3\]R\. Caruana, A\. Niculescu\-Mizil, G\. Geoffran, and al\., “Ensemble selection from libraries of models,” in*Proceedings of the 21st International Conference on Machine Learning \(ICML\)*, 2004\.
- \[4\]G\. Tsoumakas, I\. Partalas, and I\. Vlahavas, “Selective fusion of classifiers,”*Supervised and Unsupervised Ensemble Methods and Their Applications*, pp\. 123–144, 2009\.
- \[5\]D\. D\. Margineantu and T\. G\. Dietterich, “Pruning bagged classifiers,” in*Proceedings of the 14th International Conference on Machine Learning \(ICML\)*, 1997\.
- \[6\]D\. H\. Wolpert, “Stacked generalization,”*Neural Networks*, vol\. 5, no\. 2, pp\. 241–259, 1992\.
- \[7\]R\. Tibshirani, “Regression shrinkage and selection via the lasso,”*Journal of the Royal Statistical Society: Series B*, vol\. 58, no\. 1, pp\. 267–288, 1996\.
- \[8\]J\. Platt, “Probabilistic outputs for support vector machines and comparisons to regularized logistic regression,”*Advances in Large Margin Classifiers*, vol\. 10, no\. 3, pp\. 61–74, 1999\.
- \[9\]B\. Zadrozny and C\. Elkan, “Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers,” in*Proceedings of the 18th International Conference on Machine Learning \(ICML\)*, 2001\.
- \[10\]C\. Guo, G\. Pleiss, Y\. Sun, and K\. Q\. Weinberger, “On calibration of modern neural networks,” in*Proceedings of the 34th International Conference on Machine Learning \(ICML\)*, 2017\.
- \[11\]D\. Kraft, “A software package for sequential least squares quadratic programming,”*Deutsche Forschungs\- und Versuchsanstalt fur Luft\- und Raumfahrt \(DFVLR\) Report*, 1988\.
- \[12\]J\. Vanschoren, J\. N\. van Rijn, B\. Bischl, and L\. Torgo, “OpenML: Networked science in machine learning,”*ACM SIGKDD Explorations Newsletter*, vol\. 15, no\. 2, pp\. 49–60, 2014\.
- \[13\]T\. Chen and C\. Guestrin, “XGBoost: A scalable tree boosting system,” in*Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, 2016\.
- \[14\]Z\.\-H\. Zhou,*Ensemble Methods: Foundations and Algorithms*\. CRC Press, 2012\.

Similar Articles

ConSA: Controllable Sparsity in Hybrid Attention via Learnable Allocation

arXiv cs.CL

ConSA is a framework that learns optimal assignment between full attention and sliding-window attention under a user-specified sparsity target, using L0 regularization and augmented Lagrangian constraint. It demonstrates consistent gains over rule-based baselines on LLMs at 0.6B and 1.7B scales.

Reducing Learner Redundancy in Boosting via Residual Orthogonalization

arXiv cs.LG

This paper proposes SCBoost, a boosting framework that reduces learner redundancy by projecting residuals onto the orthogonal complement of previous predictions and using covariance-regularized weighting, with theoretical guarantees and strong empirical performance.

Bayesian-Agent: Posterior-Guided Skill Evolution for LLM Agent Harnesses

Hugging Face Daily Papers

Bayesian-Agent presents a framework that treats reusable skills and SOPs as hypotheses, using Bayesian inference to guide agent behavior and improve task performance through posterior-guided harness optimization. It achieves significant improvements on multiple benchmarks with deepseek-v4-flash.

Catching a Moving Subspace: Low-Rank Bandits Beyond Stationarity

arXiv cs.LG

This paper studies piecewise-stationary low-rank linear contextual bandits, proposes the SPSC algorithm that achieves dynamic regret scaling with the intrinsic rank instead of the ambient dimension, and characterizes the identification boundary for subspace recovery under scalar feedback.