Reducing Learner Redundancy in Boosting via Residual Orthogonalization

arXiv cs.LG Papers

Summary

This paper proposes SCBoost, a boosting framework that reduces learner redundancy by projecting residuals onto the orthogonal complement of previous predictions and using covariance-regularized weighting, with theoretical guarantees and strong empirical performance.

arXiv:2606.17567v1 Announce Type: new Abstract: While sequential residual fitting is the bedrock of standard boosting frameworks, it inherently breeds learner redundancy by repeatedly revisiting correlated error components. To address this bottleneck, we propose a shift from residual fitting to \textit{residual orthogonalization} and introduce SCBoost. Our framework tackles redundancy through two complementary mechanisms: Spectral Residual Projection (SRP) and Covariance-Regularized Weighting (CRW). During training, SRP projects each residual target onto the orthogonal complement of the historical prediction subspace, forcing successive learners to capture only novel empirical innovations. During aggregation, CRW optimizes ensemble weights on a validation set with an explicit covariance penalty to mitigate remaining correlations. Theoretically, we provide a finite-sample geometric characterization proving that SRP yields an exact additive residual-energy decomposition. Furthermore, under an isotropic-noise assumption, we rigorously establish the conditions under which this projection improves the effective Signal-to-Noise Ratio. Extensive experiments across ten benchmark datasets demonstrate that SCBoost delivers strong out-of-the-box performance, particularly in accuracy and F1 score. This work reinterprets boosting through a geometric lens, suggesting that explicit redundancy control is a principled and necessary step toward more efficient ensemble architectures.
Original Article
View Cached Full Text

Cached at: 06/17/26, 05:41 AM

# Reducing Learner Redundancy in Boosting via Residual Orthogonalization
Source: [https://arxiv.org/html/2606.17567](https://arxiv.org/html/2606.17567)
Ye SuJipeng GuoCollege of Information Science and Technology, Beijing University of Chemical Technology, Beijing 100029, ChinaYong LiuGaoling School of Artificial Intelligence, Renmin University of China, Beijing 100872, ChinaCorresponding authors\. Emails: liuyonggsai@ruc\.edu\.cn, d\.wu@latrobe\.edu\.au, ll\.zhao@siat\.ac\.cnXin XuSchool of Computer Science, Central China Normal University, Hubei 430000, ChinaGangchun ZhangShenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, ChinaJinxin ChenShenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, ChinaDi Wuthe School of Computing, Engineering and Mathematical Sciences, La Trobe University, Melbourne VIC 3086, AustraliaCorresponding authors\. Emails: liuyonggsai@ruc\.edu\.cn, d\.wu@latrobe\.edu\.au, ll\.zhao@siat\.ac\.cnLonglong ZhaoShenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, ChinaCorresponding authors\. Emails: liuyonggsai@ruc\.edu\.cn, d\.wu@latrobe\.edu\.au, ll\.zhao@siat\.ac\.cn

###### Abstract

While sequential residual fitting is the bedrock of standard boosting frameworks, it inherently breeds learner redundancy by repeatedly revisiting correlated error components\. To address this bottleneck, we propose a shift from residual fitting toresidual orthogonalizationand introduce SCBoost\. Our framework tackles redundancy through two complementary mechanisms: Spectral Residual Projection \(SRP\) and Covariance\-Regularized Weighting \(CRW\)\. During training, SRP projects each residual target onto the orthogonal complement of the historical prediction subspace, forcing successive learners to capture only novel empirical innovations\. During aggregation, CRW optimizes ensemble weights on a validation set with an explicit covariance penalty to mitigate remaining correlations\. Theoretically, we provide a finite\-sample geometric characterization proving that SRP yields an exact additive residual\-energy decomposition\. Furthermore, under an isotropic\-noise assumption, we rigorously establish the conditions under which this projection improves the effective Signal\-to\-Noise Ratio\. Extensive experiments across ten benchmark datasets demonstrate that SCBoost delivers strong out\-of\-the\-box performance, particularly in accuracy and F1 score\. This work reinterprets boosting through a geometric lens, suggesting that explicit redundancy control is a principled and necessary step toward more efficient ensemble architectures\.

Full Function Spaceh\(1\)h^\{\(1\)\}h\(2\)h^\{\(2\)\}h\(3\)h^\{\(3\)\}h\(4\)h^\{\(4\)\}RedundantGradient ComponentsOptimalF∗F^\{\*\}\(a\) Standard Residual Fittingh\(1\)h^\{\(1\)\}ℋt−1\\mathcal\{H\}\_\{t\-1\}h\(2\)⟂h\(1\)h^\{\(2\)\}\\perp h^\{\(1\)\}h\(3\)h^\{\(3\)\}Purified Target\(Orthogonal Innovation\)OptimalF∗F^\{\*\}\(b\) SCBoost \(Ours\)
Figure 1:Conceptual Landscape of Boosting Paradigms\.\(a\) Standard Boosting performs greedy descent in the original function space, leading to redundant updates and “zig\-zagging” behavior\. \(b\) SCBoost projects each residual onto the orthogonal complement of the historical subspaceℋt−1\\mathcal\{H\}\_\{t\-1\}beforetraining, ensuring that each new learner captures geometrically distinct information and accelerates convergence towards the optimal ensembleF∗F^\{\*\}\.## 1Introduction

Modern boosting frameworks like XGBoost and LightGBM\[[9](https://arxiv.org/html/2606.17567#bib.bib1),[19](https://arxiv.org/html/2606.17567#bib.bib2)\]owe their success primarily to engineering optimizations in efficiency and feature handling\. However, their statistical core, sequential residual fitting, has remained largely unchanged for decades\[[26](https://arxiv.org/html/2606.17567#bib.bib3),[5](https://arxiv.org/html/2606.17567#bib.bib4),[24](https://arxiv.org/html/2606.17567#bib.bib5)\]\. This paradigm inherently breeds redundancy by training new learners on highly correlated residuals, creating a “redundancy bottleneck” that limits the ensemble’s generalization capability\. Breaking this ceiling requires addressing this statistical limitation directly, even if it necessitates rethinking the computational trade\-offs that have defined the past decade of development\.

While prior efforts have sought to promote diversity through methods like negative correlation learning \(NCL\)\[[23](https://arxiv.org/html/2606.17567#bib.bib6),[22](https://arxiv.org/html/2606.17567#bib.bib7),[30](https://arxiv.org/html/2606.17567#bib.bib8)\]or randomization\[[11](https://arxiv.org/html/2606.17567#bib.bib9),[20](https://arxiv.org/html/2606.17567#bib.bib10),[31](https://arxiv.org/html/2606.17567#bib.bib11)\], they often treat diversity as a secondary objective, imposing soft penalties that create an unstable trade\-off with fitting accuracy\. More importantly, they fail to address the root cause of the problem in boosting, the correlated nature of the residual targets themselves\[[2](https://arxiv.org/html/2606.17567#bib.bib12)\]\.

To break the redundancy bottleneck, we propose a shift from residual fitting to residual orthogonalization\. As illustrated in Figure[1](https://arxiv.org/html/2606.17567#S0.F1), this mechanism compels each new learner to approximate only the error component geometrically distinct from the existing ensemble\. By projecting the learning target onto the functional null space of historical predictors, we eliminate signal redundancy and promote the extraction of novel information\. The primary contribution of this study is the introduction ofresidual orthogonalizationas a fundamental paradigm shift for boosting\. We move beyond the traditional fit\-and\-add logic and establish a new implementation,SCBoost, whose contributions are structured as follows:

- •Core Principle\. We introduceSpectral Residual Projection \(SRP\)to modify the residual target before fitting each new learner\. By applying spectral decomposition to the prediction history, SRP projects the residual onto the orthogonal complement of the selected historical prediction subspace\.
- •Geometric Characterization\. We show that SRP is an exact empirical orthogonal projection on the training sample\. This yields an additive decomposition of the residual energy into a historical component and an orthogonal innovation component\.
- •Noise Interpretation\. Under an explicit fixed\-subspace isotropic\-noise assumption, we characterize how projection changes the noise energy and the effective SNR\. The analysis shows that SNR improves only when the removed signal fraction is smaller than the removed noise fraction\.
- •Aggregation Strategy\. We introduceCovariance\-Regularized Weighting \(CRW\)to aggregate the learned predictors\. CRW uses validation\-set covariance regularization to reduce the influence of highly correlated learners, and is motivated by the ambiguity decomposition under squared loss\.

## 2SCBoost

Standard boosting fits residuals sequentially but does not explicitly control the correlation between new learners and the existing ensemble, which can lead to redundant updates \(Figure[1](https://arxiv.org/html/2606.17567#S0.F1)\(a\)\)\. We propose SCBoost, a boosting framework based on residual orthogonalization\. Instead of fitting the raw residual directly, Spectral Residual Projection \(SRP\) projects the residual target onto the orthogonal complement of the historical prediction subspace \(Figure[1](https://arxiv.org/html/2606.17567#S0.F1)\(b\)\)\. To aggregate the learned predictors, Covariance\-Regularized Weighting \(CRW\) then assigns ensemble weights with an additional covariance penalty\. SRP controls the target used for learner induction, while CRW controls the final aggregation\.

### 2\.1Spectral Residual Projection

Algorithm Description\.Let the residual vector at iterationttbe𝐫\(t\)=𝐲−σ​\(𝐅\(t−1\)\)\\mathbf\{r\}^\{\(t\)\}=\\mathbf\{y\}\-\\sigma\(\\mathbf\{F\}^\{\(t\-1\)\}\), where𝐅\(t−1\)\\mathbf\{F\}^\{\(t\-1\)\}is the current logit output on the training data andσ​\(⋅\)\\sigma\(\\cdot\)is the sigmoid function\. We maintain a prediction history matrix𝐇\(t−1\)=\[𝐡\(1\),𝐡\(2\),…,𝐡\(t−1\)\]∈ℝn×\(t−1\)\\mathbf\{H\}^\{\(t\-1\)\}=\[\\mathbf\{h\}^\{\(1\)\},\\mathbf\{h\}^\{\(2\)\},\\dots,\\mathbf\{h\}^\{\(t\-1\)\}\]\\in\\mathbb\{R\}^\{n\\times\(t\-1\)\}, where𝐡\(j\)\\mathbf\{h\}^\{\(j\)\}denotes the prediction vector of thejj\-th base learner on the training data\.

At steptt, we compute the singular value decomposition𝐇\(t−1\)=𝐔​𝚺​𝐕⊤\\mathbf\{H\}^\{\(t\-1\)\}=\\mathbf\{U\}\\mathbf\{\\Sigma\}\\mathbf\{V\}^\{\\top\}\. Let𝐔k=\[𝐮∗1,…,𝐮∗k\]∈ℝn×k\\mathbf\{U\}\_\{k\}=\[\\mathbf\{u\}\*1,\\dots,\\mathbf\{u\}\*k\]\\in\\mathbb\{R\}^\{n\\times k\}be the top\-kkleft singular vectors selected by the energy thresholdα∈\(0,1\)\\alpha\\in\(0,1\):

∑∗i=1k​σi2∑∗j=1min⁡\(n,t−1\)​σj2≥α\.\\frac\{\\sum\*\{i=1\}^\{k\}\\sigma\_\{i\}^\{2\}\}\{\\sum\*\{j=1\}^\{\\min\(n,t\-1\)\}\\sigma\_\{j\}^\{2\}\}\\geq\\alpha\.We define the projection operators𝐏k=𝐔k​𝐔k⊤\\mathbf\{P\}\_\{k\}=\\mathbf\{U\}\_\{k\}\\mathbf\{U\}\_\{k\}^\{\\top\},𝐐k=𝐈n−𝐏k\\mathbf\{Q\}\_\{k\}=\\mathbf\{I\}\_\{n\}\-\\mathbf\{P\}\_\{k\}\. The projected residual target is𝐫~\(t\)=𝐐k​𝐫\(t\)=𝐫\(t\)−𝐔k​\(𝐔k⊤​𝐫\(t\)\)\\tilde\{\\mathbf\{r\}\}^\{\(t\)\}=\\mathbf\{Q\}\_\{k\}\\mathbf\{r\}^\{\(t\)\}=\\mathbf\{r\}^\{\(t\)\}\-\\mathbf\{U\}\_\{k\}\(\\mathbf\{U\}\_\{k\}^\{\\top\}\\mathbf\{r\}^\{\(t\)\}\)\. The next learner is trained to fit𝐫~\(t\)\\tilde\{\\mathbf\{r\}\}^\{\(t\)\}rather than the raw residual𝐫\(t\)\\mathbf\{r\}^\{\(t\)\}\. This projection guarantees orthogonality of the training target to the selected historical prediction subspace\. It does not by itself guarantee that the fitted learner𝐡\(t\)\\mathbf\{h\}^\{\(t\)\}is exactly orthogonal to the previous learners, since the fitted learner also depends on the approximation capacity and optimization procedure of the base learner\.

Theoretical Analysis\.We first record the deterministic geometric property of SRP\. This result is finite\-sample and does not require a distributional assumption\.

###### Proposition 2\.1\(Empirical Orthogonal Projection\. The detialed proof in Appendix[A](https://arxiv.org/html/2606.17567#A1)\)\.

Letℋt−1=span⁡\(𝐔k\)\\mathcal\{H\}\_\{t\-1\}=\\operatorname\{span\}\(\\mathbf\{U\}\_\{k\}\), where𝐔k\\mathbf\{U\}\_\{k\}has orthonormal columns\. Let𝐏k=𝐔k​𝐔k⊤\\mathbf\{P\}\_\{k\}=\\mathbf\{U\}\_\{k\}\\mathbf\{U\}\_\{k\}^\{\\top\}and𝐐k=𝐈n−𝐏k\\mathbf\{Q\}\_\{k\}=\\mathbf\{I\}\_\{n\}\-\\mathbf\{P\}\_\{k\}\. For any residual vector𝐫\(t\)∈ℝn\\mathbf\{r\}^\{\(t\)\}\\in\\mathbb\{R\}^\{n\}, the SRP target𝐫~\(t\)=𝐐∗k​𝐫\(t\)\\tilde\{\\mathbf\{r\}\}^\{\(t\)\}=\\mathbf\{Q\}\*k\\mathbf\{r\}^\{\(t\)\}satisfies𝐫~\(t\)=arg⁡min∗𝐳∈ℝn\|𝐳−𝐫\(t\)\|∗22​s\.t\.⟨𝐳,𝐮⟩=0,∀𝐮∈ℋ∗t−1\.\\tilde\{\\mathbf\{r\}\}^\{\(t\)\}=\\mathop\{\\arg\\min\}\*\{\\mathbf\{z\}\\in\\mathbb\{R\}^\{n\}\}\\quad\|\\mathbf\{z\}\-\\mathbf\{r\}^\{\(t\)\}\|\*2^\{2\}\\ \\text\{s\.t\.\}\\quad\\langle\\mathbf\{z\},\\mathbf\{u\}\\rangle=0,\\quad\\forall\\mathbf\{u\}\\in\\mathcal\{H\}\*\{t\-1\}\.Moreover,\|𝐫\(t\)\|22=\|𝐏k​𝐫\(t\)\|22\+\|𝐐k​𝐫\(t\)\|22\|\\mathbf\{r\}^\{\(t\)\}\|\_\{2\}^\{2\}=\|\\mathbf\{P\}\_\{k\}\\mathbf\{r\}^\{\(t\)\}\|\_\{2\}^\{2\}\+\|\\mathbf\{Q\}\_\{k\}\\mathbf\{r\}^\{\(t\)\}\|\_\{2\}^\{2\}\. Equivalently,

\|𝐫~\(t\)\|22=\|𝐫\(t\)\|∗22−∑∗i=1k​\(𝐮i⊤​𝐫\(t\)\)2\.\|\\tilde\{\\mathbf\{r\}\}^\{\(t\)\}\|\_\{2\}^\{2\}=\|\\mathbf\{r\}^\{\(t\)\}\|\*2^\{2\}\-\\sum\*\{i=1\}^\{k\}\(\\mathbf\{u\}\_\{i\}^\{\\top\}\\mathbf\{r\}^\{\(t\)\}\)^\{2\}\.

Proposition[2\.1](https://arxiv.org/html/2606.17567#S2.Thmtheorem1)shows that SRP removes exactly the component of the current residual lying in the selected historical prediction subspace\. The result is an empirical statement on the training vectors\. It should not be interpreted as a guarantee of functional orthogonality outside the training sample\.

We next give a limited statistical interpretation of SRP under an explicit fixed\-subspace noise model\. The assumption that the projection is fixed relative to the noise is essential\. In the standard boosting implementation, the historical learners are trained from the same labels, so the projection matrix can be label\-noise dependent\. The following result should therefore be read as a fixed\-subspace characterization of the projection operation, not as an unconditional robustness theorem for the full adaptive algorithm\.

###### Proposition 2\.2\(Fixed\-Subspace Noise and SNR Characterization\. The detialed proof in Appendix[B](https://arxiv.org/html/2606.17567#A2)\)\.

Let𝐫=𝐬\+ϵ\\mathbf\{r\}=\\mathbf\{s\}\+\\boldsymbol\{\\epsilon\}, where𝐬∈ℝn\\mathbf\{s\}\\in\\mathbb\{R\}^\{n\}is deterministic andϵ∼𝒩​\(𝟎,ν2​𝐈n\)\\boldsymbol\{\\epsilon\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\nu^\{2\}\\mathbf\{I\}\_\{n\}\)\. Assume that𝐏k\\mathbf\{P\}\_\{k\}is a fixed rank\-kkorthogonal projector independent ofϵ\\boldsymbol\{\\epsilon\}, and let𝐐k=𝐈n−𝐏k\\mathbf\{Q\}\_\{k\}=\\mathbf\{I\}\_\{n\}\-\\mathbf\{P\}\_\{k\}withd=rank⁡\(𝐐k\)=n−kd=\\operatorname\{rank\}\(\\mathbf\{Q\}\_\{k\}\)=n\-k\. Then

𝔼​\|𝐐k​ϵ\|22=d​ν2=\(1−kn\)​𝔼​\|ϵ\|22\.\\mathbb\{E\}\|\\mathbf\{Q\}\_\{k\}\\boldsymbol\{\\epsilon\}\|\_\{2\}^\{2\}=d\\nu^\{2\}=\\left\(1\-\\frac\{k\}\{n\}\\right\)\\mathbb\{E\}\|\\boldsymbol\{\\epsilon\}\|\_\{2\}^\{2\}\.Furthermore, for anyδ∈\(0,1\)\\delta\\in\(0,1\), with probability at least1−δ1\-\\delta,

\|𝐐k​ϵ\|22≤ν2​\[d\+2​d​log⁡\(1/δ\)\+2​log⁡\(1/δ\)\]\.\|\\mathbf\{Q\}\_\{k\}\\boldsymbol\{\\epsilon\}\|\_\{2\}^\{2\}\\leq\\nu^\{2\}\\left\[d\+2\\sqrt\{d\\log\(1/\\delta\)\}\+2\\log\(1/\\delta\)\\right\]\.For the full projected residual, for anyη\>0\\eta\>0,

\|𝐐k​𝐫\|22≤\(1\+η\)​\|𝐐k​𝐬\|22\+\(1\+η−1\)​\|𝐐k​ϵ\|22\.\|\\mathbf\{Q\}\_\{k\}\\mathbf\{r\}\|\_\{2\}^\{2\}\\leq\(1\+\\eta\)\|\\mathbf\{Q\}\_\{k\}\\mathbf\{s\}\|\_\{2\}^\{2\}\+\(1\+\\eta^\{\-1\}\)\|\\mathbf\{Q\}\_\{k\}\\boldsymbol\{\\epsilon\}\|\_\{2\}^\{2\}\.If0≤k<n0\\leq k<nand𝐬≠𝟎\\mathbf\{s\}\\neq\\mathbf\{0\}, defineSNR⁡\(𝐫\)=\|𝐬\|22n​ν2\\operatorname\{SNR\}\(\\mathbf\{r\}\)=\\frac\{\|\\mathbf\{s\}\|\_\{2\}^\{2\}\}\{n\\nu^\{2\}\}andSNR⁡\(𝐐k​𝐫\)=\|𝐐k​𝐬\|22\(n−k\)​ν2\\operatorname\{SNR\}\(\\mathbf\{Q\}\_\{k\}\\mathbf\{r\}\)=\\frac\{\|\\mathbf\{Q\}\_\{k\}\\mathbf\{s\}\|\_\{2\}^\{2\}\}\{\(n\-k\)\\nu^\{2\}\}\. Letρ=\|𝐏k​𝐬\|22\|𝐬\|22\\rho=\\frac\{\|\\mathbf\{P\}\_\{k\}\\mathbf\{s\}\|\_\{2\}^\{2\}\}\{\|\\mathbf\{s\}\|\_\{2\}^\{2\}\}be the fraction of signal energy removed by the projection\. Then

SNR⁡\(𝐐k​𝐫\)SNR⁡\(𝐫\)=1−ρ1−k/n\.\\frac\{\\operatorname\{SNR\}\(\\mathbf\{Q\}\_\{k\}\\mathbf\{r\}\)\}\{\\operatorname\{SNR\}\(\\mathbf\{r\}\)\}=\\frac\{1\-\\rho\}\{1\-k/n\}\.Thus, the SNR improves if and only ifρ<kn\\rho<\\frac\{k\}\{n\}\.

Proposition[2\.2](https://arxiv.org/html/2606.17567#S2.Thmtheorem2)states the condition under which projection improves the effective SNR\. SRP always reduces the expected energy of isotropic noise under the fixed\-subspace assumption, but it improves SNR only when the removed signal fraction is smaller than the removed noise fraction\. This condition is important because an overly aggressive projection may also remove useful residual signal\.

The optimization effect of SRP can also be described by a simple one\-step identity\. Let𝐫=𝐏k​𝐫\+𝐐k​𝐫=𝐩\+𝐪\\mathbf\{r\}=\\mathbf\{P\}\_\{k\}\\mathbf\{r\}\+\\mathbf\{Q\}\_\{k\}\\mathbf\{r\}=\\mathbf\{p\}\+\\mathbf\{q\}\. Assume that the learner prediction vector𝐡\\mathbf\{h\}is normalized,\|𝐡\|2=1\|\\mathbf\{h\}\|\_\{2\}=1, and lies in the projected subspace, i\.e\.,𝐡∈Range⁡\(𝐐k\)\\mathbf\{h\}\\in\\operatorname\{Range\}\(\\mathbf\{Q\}\_\{k\}\)\. With the optimal scalar step sizeη⋆=⟨𝐪,𝐡⟩\\eta^\{\\star\}=\\langle\\mathbf\{q\},\\mathbf\{h\}\\rangle, we have

\|𝐫−η⋆​𝐡\|22\\displaystyle\|\\mathbf\{r\}\-\\eta^\{\\star\}\\mathbf\{h\}\|\_\{2\}^\{2\}=\|𝐩\|22\+\|𝐪\|22−⟨𝐪,𝐡⟩2\\displaystyle=\|\\mathbf\{p\}\|\_\{2\}^\{2\}\+\|\\mathbf\{q\}\|\_\{2\}^\{2\}\-\\langle\\mathbf\{q\},\\mathbf\{h\}\\rangle^\{2\}=\|𝐏k​𝐫\|22\+\(1−γ2\)​\|𝐐k​𝐫\|22,\\displaystyle=\|\\mathbf\{P\}\_\{k\}\\mathbf\{r\}\|\_\{2\}^\{2\}\+\(1\-\\gamma^\{2\}\)\|\\mathbf\{Q\}\_\{k\}\\mathbf\{r\}\|\_\{2\}^\{2\},whereγ=⟨𝐐k​𝐫,𝐡⟩\|𝐐k​𝐫\|2\\gamma=\\frac\{\\langle\\mathbf\{Q\}\_\{k\}\\mathbf\{r\},\\mathbf\{h\}\\rangle\}\{\|\\mathbf\{Q\}\_\{k\}\\mathbf\{r\}\|\_\{2\}\}\.

This identity shows that SRP focuses the new update on the orthogonal innovation component𝐐k​𝐫\\mathbf\{Q\}\_\{k\}\\mathbf\{r\}\. It does not imply that the full residual energy always decreases faster than standard gradient boosting\.

Algorithm 1SCBoost Algorithm1:Input:Dataset

S=\(xi,yi\)∗i=1nS=\{\(x\_\{i\},y\_\{i\}\)\}\*\{i=1\}^\{n\}, iterations

TT, learning rate

η\\eta, projection threshold

α\\alpha, covariance coefficient

λ∗cov\\lambda\*\{\\mathrm\{cov\}\}
2:Data Split:Partition

SSinto training set

𝒟∗t​r​a​i​n=\(𝐗∗t​r,𝐲∗t​r\)\\mathcal\{D\}\*\{train\}=\(\\mathbf\{X\}\*\{tr\},\\mathbf\{y\}\*\{tr\}\)and internal validation set

𝒟∗v​a​l=\(𝐗∗v​a​l,𝐲∗v​a​l\)\\mathcal\{D\}\*\{val\}=\(\\mathbf\{X\}\*\{val\},\\mathbf\{y\}\*\{val\}\)\.

3:Initialize logits

𝐅\(0\)=𝟎\\mathbf\{F\}^\{\(0\)\}=\\mathbf\{0\}on

𝒟∗t​r​a​i​n\\mathcal\{D\}\*\{train\}\.

4:Initialize prediction history

𝐇\(0\)←\[\]\\mathbf\{H\}^\{\(0\)\}\\leftarrow\[\]\.

5:for

t=1t=1to

TTdo

6:Compute residual on

𝒟∗t​r​a​i​n\\mathcal\{D\}\*\{train\}:

𝐫\(t\)=𝐲∗t​r−σ​\(𝐅\(t−1\)\)\\mathbf\{r\}^\{\(t\)\}=\\mathbf\{y\}\*\{tr\}\-\\sigma\(\\mathbf\{F\}^\{\(t\-1\)\}\)\.

7://Spectral Residual Projection \(SRP\)

8:if

t\>1t\>1then

9:Perform SVD:

𝐇\(t−1\)=𝐔​𝚺​𝐕⊤\\mathbf\{H\}^\{\(t\-1\)\}=\\mathbf\{U\}\\mathbf\{\\Sigma\}\\mathbf\{V\}^\{\\top\}\.

10:Select top\-

kkcomponents

𝐔∗k\\mathbf\{U\}\*kby Eq\. \(LABEL:eq:energy\_threshold\)\.

11:Project residual:

𝐫~\(t\)=𝐫\(t\)−𝐔∗k​\(𝐔∗k⊤​𝐫\(t\)\)\\tilde\{\\mathbf\{r\}\}^\{\(t\)\}=\\mathbf\{r\}^\{\(t\)\}\-\\mathbf\{U\}\*k\(\\mathbf\{U\}\*k^\{\\top\}\\mathbf\{r\}^\{\(t\)\}\)\.

12:else

13:

𝐫~\(t\)=𝐫\(t\)\\tilde\{\\mathbf\{r\}\}^\{\(t\)\}=\\mathbf\{r\}^\{\(t\)\}\.

14:endif

15:Train weak learner

h\(t\)h^\{\(t\)\}on

\(𝐗∗t​r,𝐫~\(t\)\)\(\\mathbf\{X\}\*\{tr\},\\tilde\{\\mathbf\{r\}\}^\{\(t\)\}\)\.

16:Update logits:

𝐅\(t\)=𝐅\(t−1\)\+η​h\(t\)​\(𝐗∗t​r\)\\mathbf\{F\}^\{\(t\)\}=\\mathbf\{F\}^\{\(t\-1\)\}\+\\eta h^\{\(t\)\}\(\\mathbf\{X\}\*\{tr\}\)\.

17:Update prediction history:

𝐇\(t\)←\[𝐇\(t−1\),h\(t\)​\(𝐗∗t​r\)\]\\mathbf\{H\}^\{\(t\)\}\\leftarrow\[\\mathbf\{H\}^\{\(t\-1\)\},h^\{\(t\)\}\(\\mathbf\{X\}\*\{tr\}\)\]\.

18:endfor

19://Covariance\-Regularized Weighting \(CRW\)

20:Construct validation prediction matrix

𝐇∗v​a​l\\mathbf\{H\}\*\{val\}with

𝐇∗v​a​l​\[i,j\]=h\(j\)​\(xi\)\\mathbf\{H\}\*\{val\}\[i,j\]=h^\{\(j\)\}\(x\_\{i\}\)for

xi∈𝐗∗v​a​lx\_\{i\}\\in\\mathbf\{X\}\*\{val\}\.

21:Compute covariance matrix

𝐂=1m​𝐇¯∗v​a​l⊤​𝐇¯∗v​a​l\\mathbf\{C\}=\\frac\{1\}\{m\}\\bar\{\\mathbf\{H\}\}\*\{val\}^\{\\top\}\\bar\{\\mathbf\{H\}\}\*\{val\}\.

22:Solve

min𝐰∈ΔT⁡ℒ​\(𝐲∗v​a​l,σ​\(𝐇∗v​a​l​𝐰\)\)\+λcov​𝐰⊤​𝐂𝐰\\min\_\{\\mathbf\{w\}\\in\\Delta\_\{T\}\}\\mathcal\{L\}\\left\(\\mathbf\{y\}\*\{val\},\\sigma\(\\mathbf\{H\}\*\{val\}\\mathbf\{w\}\)\\right\)\+\\lambda\_\{\\mathrm\{cov\}\}\\mathbf\{w\}^\{\\top\}\\mathbf\{C\}\\mathbf\{w\}\.

23:Output:Final predictor

F​\(x\)=∑t=1Twt​h\(t\)​\(x\)F\(x\)=\\sum\_\{t=1\}^\{T\}w\_\{t\}h^\{\(t\)\}\(x\)\.

### 2\.2Covariance\-Regularized Weighting

Algorithm Description\.SRP orthogonalizes the training target before learner induction, but the fitted learners may still be correlated because of limited learner capacity and finite\-sample effects\. CRW addresses this issue at the aggregation stage\. Let𝐇∗v​a​l∈ℝm×T\\mathbf\{H\}\*\{val\}\\in\\mathbb\{R\}^\{m\\times T\}be the validation prediction matrix, where𝐇∗v​a​l​\[i,j\]=h\(j\)​\(xi\),xi∈𝐗∗v​a​l\\mathbf\{H\}\*\{val\}\[i,j\]=h^\{\(j\)\}\(x\_\{i\}\),\\quad x\_\{i\}\\in\\mathbf\{X\}\*\{val\}\. Let𝐇¯∗v​a​l\\bar\{\\mathbf\{H\}\}\*\{val\}be the column\-centered version of𝐇∗v​a​l\\mathbf\{H\}\*\{val\}, and define𝐂=1m​𝐇¯∗v​a​l⊤​𝐇¯∗v​a​l\\mathbf\{C\}=\\frac\{1\}\{m\}\\bar\{\\mathbf\{H\}\}\*\{val\}^\{\\top\}\\bar\{\\mathbf\{H\}\}\*\{val\}\. The final ensemble uses weights𝐰\\mathbf\{w\}on the probability simplexΔT=\(𝐰∈ℝ∗\+T:∑t=1Twt=1\)\\Delta\_\{T\}=\\left\(\\mathbf\{w\}\\in\\mathbb\{R\}\*\+^\{T\}:\\sum\_\{t=1\}^\{T\}w\_\{t\}=1\\right\)\. We solve

min𝐰∈ΔTℒ​\(𝐲∗v​a​l,σ​\(𝐇∗v​a​l​𝐰\)\)\+λcov​𝐰⊤​𝐂𝐰,\\min\_\{\\mathbf\{w\}\\in\\Delta\_\{T\}\}\\quad\\mathcal\{L\}\\left\(\\mathbf\{y\}\*\{val\},\\sigma\(\\mathbf\{H\}\*\{val\}\\mathbf\{w\}\)\\right\)\+\\lambda\_\{\\mathrm\{cov\}\}\\mathbf\{w\}^\{\\top\}\\mathbf\{C\}\\mathbf\{w\},\(1\)whereℒ​\(⋅,⋅\)\\mathcal\{L\}\(\\cdot,\\cdot\)is the logistic loss andλcov≥0\\lambda\_\{\\mathrm\{cov\}\}\\geq 0controls the strength of covariance regularization\. The term𝐰⊤​𝐂𝐰\\mathbf\{w\}^\{\\top\}\\mathbf\{C\}\\mathbf\{w\}penalizes covariance\-weighted aggregate prediction variance on the validation set\. This encourages the final ensemble to avoid placing large weights on mutually redundant predictors when doing so does not improve validation loss\.

Theoretical Motivation\.We do not use a standalone Rademacher\-complexity theorem to justify Eq\. \([1](https://arxiv.org/html/2606.17567#S2.E1)\)\. While recent literature has deeply analyzed the high\-dimensional risk and implicit bias of boosting under squared loss \(ℓ2\\ell\_\{2\}\-boosting\)\[[28](https://arxiv.org/html/2606.17567#bib.bib52)\], we use the standard ambiguity decomposition as a motivation for covariance\-aware aggregation\. For squared loss and convex weights, defineF​\(x\)=∑t=1Twt​h\(t\)​\(x\)F\(x\)=\\sum\_\{t=1\}^\{T\}w\_\{t\}h^\{\(t\)\}\(x\)\. Then the following identity holds for each sample\(x,y\)\(x,y\):

\(F​\(x\)−y\)2=\\displaystyle\(F\(x\)\-y\)^\{2\}=∑t=1Twt​\(h\(t\)​\(x\)−y\)2\\displaystyle\\sum\_\{t=1\}^\{T\}w\_\{t\}\(h^\{\(t\)\}\(x\)\-y\)^\{2\}−∑t=1Twt​\(h\(t\)​\(x\)−F​\(x\)\)2\.\\displaystyle\-\\sum\_\{t=1\}^\{T\}w\_\{t\}\(h^\{\(t\)\}\(x\)\-F\(x\)\)^\{2\}\.The second term is the ambiguity term\. For a fixed weighted average individual error, increasing ambiguity reduces the squared error of the ensemble\. Since strongly correlated learners tend to have smaller ambiguity, the covariance penalty in Eq\. \([1](https://arxiv.org/html/2606.17567#S2.E1)\) is consistent with this decomposition\. This argument is a motivation for CRW under squared loss; it is not a generalization guarantee for the logistic\-loss objective\.

### 2\.3Algorithm Summary

SCBoost differs from diversity\-promoting methods such as NCL in where the diversity constraint is introduced\. SCBoost modifies the target used to train each new learner, while CRW adjusts the final aggregation weights\. The complete procedure is summarized in Algorithm[1](https://arxiv.org/html/2606.17567#alg1)\.

- •Target vs\. Output:SRP orthogonalizes the learning target on the training sample\. This is different from directly penalizing correlations among model outputs during learner training\.
- •Pre\-emptive vs\. Corrective:SRP removes the selected historical component from the residual before fitting the next learner\. CRW then handles remaining correlations after the learners have been trained\.
- •Geometric Interpretation:SRP decomposes the residual into a historical component and an orthogonal innovation component\. The next learner is trained on the innovation component\.

## 3Experiments and Results

In this section, we report the experimental results regarding our proposed SCBoost algorithm in modeling real\-world binary classification task in various domains\. The primary objective is to evaluate its prediction performance, ensemble diversity, robustness under different noisy ratio settings, hyperparameter sensitivity analysis and ablation studies to quantify the contribution of each key component of our method\.

### 3\.1Setup

Dataset\.We used five different datasets from OpenML repository, which are all publicly available\. These datasets include Madelon \(2,600 samples, 500 features\)\[[17](https://arxiv.org/html/2606.17567#bib.bib34)\], Jasmine \(2,984 samples, 144 features\), Bioresponse \(3,751 samples, 1,776 features\), Creditcard \(284,807 samples, 30 features\)\[[10](https://arxiv.org/html/2606.17567#bib.bib35)\], OVA\_Uterus \(1,545 samples, 10,936 features\)\[[27](https://arxiv.org/html/2606.17567#bib.bib33)\], QSAR \(1,055 samples, 41 features\), Scene \(2,407 samples, 300 features\), Letter \(20,000 samples, 42 features\), Pol \(10,082 samples, 27 features\) and the Elevators \(16,599 samples, 17 features\)\. We employed random undersampling\[[18](https://arxiv.org/html/2606.17567#bib.bib37),[21](https://arxiv.org/html/2606.17567#bib.bib36)\]to address class imbalance in the Creditcard and OVA\_Uterus datasets\.

Baselines\.We selected a comprehensive set of seven baseline algorithms \(RF\[[4](https://arxiv.org/html/2606.17567#bib.bib50)\], ADA\[[13](https://arxiv.org/html/2606.17567#bib.bib13)\], GBDT\[[14](https://arxiv.org/html/2606.17567#bib.bib14)\], XGB\[[9](https://arxiv.org/html/2606.17567#bib.bib1)\], LGBM\[[19](https://arxiv.org/html/2606.17567#bib.bib2)\], CAT\[[25](https://arxiv.org/html/2606.17567#bib.bib15)\], and NGB\[[12](https://arxiv.org/html/2606.17567#bib.bib38)\]\) that represent the major paradigms in EL and the current SOTA in gradient boosting\. This allows for a thorough comparison, situating SCBoost’s performance within the broader landscape of ensemble methods\. The specific parameter configurations for each baseline and SCBoost are provided in Appendix[F](https://arxiv.org/html/2606.17567#A6)\.

Table 1:Out\-of\-the\-box performance for SCBoost compared with the SOAT ensemble algorithms\.Note:For the OVA\_Uterus dataset, the NGB results \(\#\) were obtained withnatural\_gradient=Falsedue to numerical instability in the official setting\. All baselines marked with∗\*are significantly different from SCBoost according to the Wilcoxon signed\-rank test, where∗denotesp<0\.05p<0\.05and∗∗denotesp<0\.01p<0\.01\.

Metrics and Protocol\.Our experimental evaluation was designed to be comprehensive, assessing both the prediction performance and the underlying ensemble diversity of SCBoost against SOTA baselines\. To rigorously measure performance, we employed a suite of standard classification metrics, including overall accuracy \(ACC\), the F1 score \(F1\) and Area Under the ROC Curve \(AUC\)\. To evaluate ensemble diversity, we employed three well\-established diversity measures: Q\-statistic \(Q\-s\), Disagreement \(Dis\), and Ambiguity \(Amb\)\. To avoid data leakage and ensure reliable evaluation, we used Stratified ten\-fold cross\-validation across all experiments with strict isolation of preprocessing\. Crucially, for SCBoost, we employed a nested validation strategy to optimize the CRW weights: within each training fold, we performed an internal 80/20 split\. The 80% subset was used for SRP projection and base learner training, while the held\-out 20% subset was reserved strictly for covariance estimation and weight optimization\. Finally, we reported the averaged results over ten\-fold cross\-validation\. The best results are highlighted inbold, and the second\-best results areunderlined\.

Environmental Setting\.All experiments were conducted locally using Jupyter Notebook in an Anaconda environment on Windows 10 \(Version 10\.0\.19045\)\. The hardware infrastructure utilized an Intel processor \(Intel64 Family 6 Model 151 Stepping 2, GenuineIntel\) equipped with 20 logical cores\. The software stack was built on Python 3\.12\.4, utilizing the following key libraries: pandas 2\.2\.2, numpy 1\.26\.4, scikit\-learn 1\.7\.2, statsmodels 0\.14\.2, imbalanced\-learn 0\.14\.0, lightgbm 4\.3\.0, xgboost 2\.0\.3, catboost 1\.2\.3, and NGBoost 0\.5\.7\.

### 3\.2Out\-of\-the\-box Performance Comparison

Table[1](https://arxiv.org/html/2606.17567#S3.T1)reports the out\-of\-the\-box performance of SCBoost and seven ensemble baselines on ten benchmark datasets\. These results showed that SCBoost achieved the best or tied\-best ACC and F1 on nine datasets, with clear gains on Madelon, Jasmine, Bioresponse, Pol, and Elevators\. The AUC results were more mixed: SCBoost performed best or tied for best on Madelon, Jasmine, Bioresponse, OVA\_Uterus, and Elevators, but was not uniformly superior on Creditcard, QSAR, Scene, Letter, and Pol\. Wilcoxon signed\-rank tests indicated that many improvements were significant, while the differences were smaller or insignificant on datasets where several baselines already performed competitively\. These results demonstrated the effectiveness of SCBoost in an out\-of\-the\-box setting, where extensive hyperparameter tuning may be impractical due to limited computational resources\.

Table 2:Performance of SCBoost \(default\) against tuned SOTA ensembles\.Note:The NGB results \(\#\) were obtained withnatural\_gradient=Falsedue to numerical instability in the official setting\. See Appendix[F](https://arxiv.org/html/2606.17567#A6)for details\. All baselines marked with∗\*are significantly different from SCBoost according to the Wilcoxon signed\-rank test, where∗denotesp<0\.05p<0\.05and∗∗denotesp<0\.01p<0\.01\.

Table 3:Diversity metrics of ensemble methods across benchmark datasets\.Note:If SCBoost and some baselines yield identical fold\-level values, resulting in zero paired differences\. In such cases, the Wilcoxon signed\-rank test is not applicable \(\#\)\. All baselines marked with∗\*are significantly different from SCBoost according to the Wilcoxon signed\-rank test, where∗denotesp<0\.05p<0\.05,∗∗denotesp<0\.01p<0\.01, and∗∗∗denotesp<0\.001p<0\.001\.

### 3\.3Comparison with Tuned SOTA Baselines

Table[2](https://arxiv.org/html/2606.17567#S3.T2)presents the comparison between SCBoost with fixed default parameters and seven tuned ensemble baselines on five benchmark datasets\. These results reported that SCBoost achieved the best ACC and F1 on all five datasets, and obtained the best AUC on four of them\. The gains were clear on Madelon, Jasmine, and Bioresponse, where SCBoost improved over the strongest tuned baseline by a large margin in ACC and F1\. On Creditcard, SCBoost achieved the best ACC and F1, while its AUC was lower than that of LGBM and XGB\. On OVA\_Uterus, SCBoost obtained the best ACC, F1, and AUC, although the differences in AUC were small\. Although tuning improved most baselines, SCBoost still achieved the strongest overall ACC and F1 without dataset\-specific tuning, suggesting that its gains were not merely replaceable by hyperparameter optimization\.

### 3\.4Ensemble Diversity Analysis

Table[3](https://arxiv.org/html/2606.17567#S3.T3)summarizes the diversity metrics on three benchmark datasets\. These results exhibited taht SCBoost obtained higher Dis and Amb scores than the baselines, while its Q\-s values were close to zero\. These results were consistent with the empirical effect of SRP: the residual target was projected onto the orthogonal complement of the selected historical prediction subspace before training the next learner, which reduced components already represented by previous learners\. This does not guarantee exact orthogonality of the fitted learners, since the base learner may only approximate the projected target\. The higher Amb scores were also consistent with the role of CRW, which penalized validation\-set covariance during weight optimization\. Beyond statistical metrics, SCBoost exhibited a significantly flatter singular value spectrum in its prediction history compared to GBDT in Appendix[C](https://arxiv.org/html/2606.17567#A3)\. This slower spectral decay provided direct geometric evidence that SRP effectively forces learners to span a broader, less redundant prediction subspace\. Overall, this study indicates that SRP and CRW reduced empirical learner redundancy under the reported diversity metrics\.

![Refer to caption](https://arxiv.org/html/2606.17567v1/x1.png)

Figure 2:Robustness evaluation for SCBoost and baselines against increasing levels \(10%\-30%\) of label noise based on Madelon dataset\.![Refer to caption](https://arxiv.org/html/2606.17567v1/x2.png)

Figure 3:Robustness evaluation for SCBoost and baselines against increasing levels \(10%\-30%\) of label noise based on Jasmine dataset\.
### 3\.5Robustness Analysis

Figure[2](https://arxiv.org/html/2606.17567#S3.F2)and Figure[3](https://arxiv.org/html/2606.17567#S3.F3)reports the robustness results under 10%\-30% label\-flip noise on Madelon and Jasmine\. These results showed that SCBoost maintained competitive performance as the noise ratio increased, while several baselines showed larger degradation\. These results were consistent with the motivation of SRP: instead of fitting the full residual𝐫\(t\)\\mathbf\{r\}^\{\(t\)\}, SCBoost trained each new learner on the projected target𝐫~\(t\)\\tilde\{\\mathbf\{r\}\}^\{\(t\)\}, which removed the component lying in the selected historical prediction subspace\. The fixed\-subspace analysis in Proposition[2\.2](https://arxiv.org/html/2606.17567#S2.Thmtheorem2)showed that such a projection reduced the expected energy of isotropic noise under explicit independence assumptions, and improved SNR only when the removed signal fraction was smaller than the removed noise fraction\. Since the robustness experiments used label\-flip noise rather than additive Gaussian noise, these results should be interpreted as empirical evidence of noise robustness, not as a direct validation of the fixed\-subspace SNR analysis\.

![Refer to caption](https://arxiv.org/html/2606.17567v1/x3.png)Figure 4:Learning rate sensitivity analysis of SCBoost on Jasmine, Madelon, and Bioresponse\. We evaluatedη∈0\.001,0\.01,0\.05,0\.1,0\.5\\eta\\in\{0\.001,0\.01,0\.05,0\.1,0\.5\}and reported ACC, F1, and AUC for each dataset\.Table 4:Ablation of SCBoost algorithm based on three datasets\.
### 3\.6Hyperparameter Sensitivity Analysis

Figure[4](https://arxiv.org/html/2606.17567#S3.F4)reports the sensitivity of SCBoost to the learning rateη\\etaon Jasmine, Madelon, and Bioresponse\. We evaluatedη∈0\.001,0\.01,0\.05,0\.1,0\.5\\eta\\in\{0\.001,0\.01,0\.05,0\.1,0\.5\}using ACC, F1, and AUC\. Overall, SCBoost remained stable across different learning rates, especially in terms of AUC\. On all three datasets, larger learning rates generally improved ACC and F1, withη=0\.5\\eta=0\.5giving the best or near\-best results\. In contrast,η=0\.1\\eta=0\.1led to a small performance drop on Madelon and Bioresponse, suggesting that the effect ofη\\etawas not strictly monotonic\. These results indicated that SCBoost was not highly sensitive to small learning rates and that a moderately large learning rate could improve empirical performance under the default setting\. Additionally, Appendix[D](https://arxiv.org/html/2606.17567#A4)showed that shallow trees underfit orthogonal innovations, while deep trees \(D\>5D\>5\) overfit without improving the redundancy ratio\. Therefore, SCBoost optimally requires moderate\-depth learners\.

### 3\.7Ablation

Table[4](https://arxiv.org/html/2606.17567#S3.T4)presents the ablation results on Madelon, Jasmine, and Bioresponse\. SRP alone outperformed CRW alone, suggesting that modifying the residual target was more influential than only reweighting the trained learners\. Combining SRP and CRW achieved the best performance on all three datasets\. For example, on Bioresponse, SRP\+CRW improved ACC by more than 7% compared with either component alone\. These results were consistent with the roles of the two components: SRP removed the selected historical component from the residual target as described in Proposition[2\.1](https://arxiv.org/html/2606.17567#S2.Thmtheorem1), while CRW reduced the influence of highly correlated learners through validation\-set covariance regularization\. Overall, the ablation results indicated that both target projection and covariance\-aware aggregation contributed to the final performance\.

## 4Conclusion

In this work, we re\-examined learner redundancy in gradient boosting and proposed a shift from traditional residual fitting to explicit residual orthogonalization\. Our proposed framework, SCBoost, achieves this through a dual mechanism: purifying the learning target via Spectral Residual Projection \(SRP\) and penalizing ensemble correlation via Covariance\-Regularized Weighting \(CRW\)\. We theoretically characterized SRP’s ability to isolate empirical innovations and improve the effective noise level, which was corroborated by SCBoost’s strong default performance across multiple benchmark datasets\. Although we observed some metric\-specific variations that depend on data characteristics, our findings clearly demonstrate the value of geometric target modification in ensemble learning\. Future work will focus on designing more computationally scalable projection techniques, enabling this non\-redundant boosting paradigm to be deployed on larger\-scale industrial applications\.

## Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant 62403043\. The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper\.

## References

- \[1\]T\. Akiba, S\. Sano, T\. Yanase, T\. Ohta, and M\. Koyama\(2019\)Optuna: a next\-generation hyperparameter optimization framework\.InProceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining,pp\. 2623–2631\.Cited by:[Appendix F](https://arxiv.org/html/2606.17567#A6.p2.2)\.
- \[2\]P\. Bartlett, Y\. Freund, W\. S\. Lee, and R\. E\. Schapire\(1998\)Boosting the margin: a new explanation for the effectiveness of voting methods\.The Annals of Statistics26\(5\),pp\. 1651–1686\.Cited by:[§1](https://arxiv.org/html/2606.17567#S1.p2.1)\.
- \[3\]L\. Breiman\(1996\)Bagging predictors\.Machine Learning24\(2\),pp\. 123–140\.External Links:[Document](https://dx.doi.org/10.1007/BF00058655)Cited by:[Appendix G](https://arxiv.org/html/2606.17567#A7.p2.1)\.
- \[4\]L\. Breiman\(2001\)Random forests\.Machine learning45\(1\),pp\. 5–32\.Cited by:[§3\.1](https://arxiv.org/html/2606.17567#S3.SS1.p2.1)\.
- \[5\]P\. Bühlmann and T\. Hothorn\(2007\)Boosting algorithms: regularization, prediction and model fitting\.Cited by:[§1](https://arxiv.org/html/2606.17567#S1.p1.1)\.
- \[6\]P\. Bühlmann and B\. Yu\(2002\)Analyzing bagging\.The Annals of Statistics30\(4\),pp\. 927–961\.Cited by:[Appendix G](https://arxiv.org/html/2606.17567#A7.p2.1)\.
- \[7\]H\. Chen, A\. G\. Cohn, and X\. Yao\(2012\)Ensemble learning by negative correlation learning\.InEnsemble Machine Learning,pp\. 177–201\.Cited by:[Appendix G](https://arxiv.org/html/2606.17567#A7.p2.1)\.
- \[8\]H\. Chen and X\. Yao\(2009\)Regularized negative correlation learning for neural network ensembles\.IEEE Transactions on Neural Networks20\(12\),pp\. 1962–1979\.Cited by:[Appendix G](https://arxiv.org/html/2606.17567#A7.p2.1)\.
- \[9\]T\. Chen and C\. Guestrin\(2016\)XGBoost: a scalable tree boosting system\.InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,pp\. 785–794\.External Links:[Document](https://dx.doi.org/10.1145/2939672.2939785)Cited by:[Appendix G](https://arxiv.org/html/2606.17567#A7.p1.1),[§1](https://arxiv.org/html/2606.17567#S1.p1.1),[§3\.1](https://arxiv.org/html/2606.17567#S3.SS1.p2.1)\.
- \[10\]A\. Dal Pozzolo, O\. Caelen, R\. A\. Johnson, and G\. Bontempi\(2015\)Calibrating probability with undersampling for unbalanced classification\.In2015 IEEE Symposium Series on Computational Intelligence,pp\. 159–166\.Cited by:[§3\.1](https://arxiv.org/html/2606.17567#S3.SS1.p1.1)\.
- \[11\]T\. G\. Dietterich\(2000\)An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization\.Machine Learning40\(2\),pp\. 139–157\.Cited by:[§1](https://arxiv.org/html/2606.17567#S1.p2.1)\.
- \[12\]T\. Duan, A\. Anand, D\. Y\. Ding, K\. K\. Thai, S\. Basu, A\. Ng, and A\. Schuler\(2020\)NGBoost: natural gradient boosting for probabilistic prediction\.InInternational Conference on Machine Learning,pp\. 2690–2700\.Cited by:[§3\.1](https://arxiv.org/html/2606.17567#S3.SS1.p2.1)\.
- \[13\]Y\. Freund and R\. E\. Schapire\(1997\)A decision\-theoretic generalization of on\-line learning and an application to boosting\.Journal of Computer and System Sciences55\(1\),pp\. 119–139\.Cited by:[Appendix G](https://arxiv.org/html/2606.17567#A7.p1.1),[§3\.1](https://arxiv.org/html/2606.17567#S3.SS1.p2.1)\.
- \[14\]J\. H\. Friedman\(2001\)Greedy function approximation: a gradient boosting machine\.The Annals of Statistics29\(5\),pp\. 1189–1232\.External Links:[Document](https://dx.doi.org/10.1214/aos/1013203451)Cited by:[Appendix G](https://arxiv.org/html/2606.17567#A7.p1.1),[§3\.1](https://arxiv.org/html/2606.17567#S3.SS1.p2.1)\.
- \[15\]P\. Geurts, D\. Ernst, and L\. Wehenkel\(2006\)Extremely randomized trees\.Machine Learning63\(1\),pp\. 3–42\.Cited by:[Appendix G](https://arxiv.org/html/2606.17567#A7.p2.1)\.
- \[16\]S\. Gu, Y\. Hou, L\. Zhang, and Y\. Zhang\(2018\)Regularizing deep neural networks with an ensemble\-based decorrelation method\.InProceedings of the International Joint Conference on Artificial Intelligence \(IJCAI\),Vol\.1,pp\. 2177–2183\.Cited by:[Appendix G](https://arxiv.org/html/2606.17567#A7.p2.1)\.
- \[17\]I\. Guyon, J\. Li, T\. Mader, P\. A\. Pletscher, G\. Schneider, and M\. Uhr\(2007\)Competitive baseline methods set new standards for the NIPS 2003 feature selection benchmark\.Pattern Recognition Letters28\(12\),pp\. 1438–1444\.Cited by:[§3\.1](https://arxiv.org/html/2606.17567#S3.SS1.p1.1)\.
- \[18\]T\. Hasanin and T\. Khoshgoftaar\(2018\)The effects of random undersampling with simulated class imbalance for big data\.In2018 IEEE International Conference on Information Reuse and Integration \(IRI\),pp\. 70–79\.Cited by:[§3\.1](https://arxiv.org/html/2606.17567#S3.SS1.p1.1)\.
- \[19\]G\. Ke, Q\. Meng, T\. Finley, T\. Wang, W\. Chen, W\. Ma, and T\. Liu\(2017\)LightGBM: a highly efficient gradient boosting decision tree\.InAdvances in Neural Information Processing Systems 30,Cited by:[Appendix G](https://arxiv.org/html/2606.17567#A7.p1.1),[§1](https://arxiv.org/html/2606.17567#S1.p1.1),[§3\.1](https://arxiv.org/html/2606.17567#S3.SS1.p2.1)\.
- \[20\]S\. Kotsiantis\(2011\)Combining bagging, boosting, rotation forest and random subspace methods\.Artificial Intelligence Review35\(3\),pp\. 223–240\.Cited by:[§1](https://arxiv.org/html/2606.17567#S1.p2.1)\.
- \[21\]B\. Liu and G\. Tsoumakas\(2020\)Dealing with class imbalance in classifier chains via random undersampling\.Knowledge\-Based Systems192,pp\. 105292\.Cited by:[§3\.1](https://arxiv.org/html/2606.17567#S3.SS1.p1.1)\.
- \[22\]Y\. Liu, X\. Yao, and T\. Higuchi\(2000\)Evolutionary ensembles with negative correlation learning\.IEEE Transactions on Evolutionary Computation4\(4\),pp\. 380–387\.Cited by:[§1](https://arxiv.org/html/2606.17567#S1.p2.1)\.
- \[23\]Y\. Liu and X\. Yao\(1999\)Ensemble learning via negative correlation\.Neural Networks12\(10\),pp\. 1399–1404\.Cited by:[§1](https://arxiv.org/html/2606.17567#S1.p2.1)\.
- \[24\]A\. Mayr, H\. Binder, O\. Gefeller, and M\. Schmid\(2014\)The evolution of boosting algorithms\.Methods of Information in Medicine53\(06\),pp\. 419–427\.Cited by:[§1](https://arxiv.org/html/2606.17567#S1.p1.1)\.
- \[25\]L\. Prokhorenkova, G\. Gusev, A\. Vorobev, and N\. Dvornik\(2018\)CatBoost: unbiased boosting with categorical features\.InAdvances in Neural Information Processing Systems 31,Cited by:[Appendix G](https://arxiv.org/html/2606.17567#A7.p1.1),[§3\.1](https://arxiv.org/html/2606.17567#S3.SS1.p2.1)\.
- \[26\]R\. E\. Schapire\(1999\)A brief introduction to boosting\.InProceedings of the 16th International Joint Conference on Artificial Intelligence \(IJCAI\),Vol\.99,pp\. 1401–1406\.Cited by:[§1](https://arxiv.org/html/2606.17567#S1.p1.1)\.
- \[27\]G\. Stiglic and P\. Kokol\(2010\)Stability of ranked gene lists in large microarray analysis studies\.BioMed Research International2010\(1\),pp\. 616358\.Cited by:[§3\.1](https://arxiv.org/html/2606.17567#S3.SS1.p1.1)\.
- \[28\]Y\. Su, J\. Li, and Y\. Liu\(2026\)When doesℓ2\\ell\_\{2\}\-boosting overfit benignly? high\-dimensional risk asymptotics and theℓ1\\ell\_\{1\}implicit bias\.arXiv preprint arXiv:2605\.06314\.Cited by:[§2\.2](https://arxiv.org/html/2606.17567#S2.SS2.p2.3)\.
- \[29\]Y\. Su, L\. Zhao, D\. Garcia\-Gil, J\. Guo, G\. Zhang, J\. Chen, and J\. Chen\(2026\)ITBoost: information\-theoretic trust for robust boosting\.arXiv preprint arXiv:2605\.04671\.Cited by:[Appendix G](https://arxiv.org/html/2606.17567#A7.p1.1)\.
- \[30\]S\. Wang, H\. Chen, and X\. Yao\(2010\)Negative correlation learning for classification ensembles\.InThe 2010 International Joint Conference on Neural Networks \(IJCNN\),pp\. 1–8\.Cited by:[§1](https://arxiv.org/html/2606.17567#S1.p2.1)\.
- \[31\]Y\. Wen, D\. Tran, and J\. Ba\(2020\)BatchEnsemble: an alternative approach to efficient ensemble and lifelong learning\.Note:arXiv preprint arXiv:2002\.06715Cited by:[§1](https://arxiv.org/html/2606.17567#S1.p2.1)\.
- \[32\]B\. Xie, Y\. Liang, and L\. Song\(2017\)Diverse neural network learns true target functions\.InArtificial Intelligence and Statistics,pp\. 1216–1224\.Cited by:[Appendix G](https://arxiv.org/html/2606.17567#A7.p2.1)\.
- \[33\]J\. Yu\(2019\)A selective deep stacked denoising autoencoders ensemble with negative correlation learning for gearbox fault diagnosis\.Computers in Industry108,pp\. 62–72\.Cited by:[Appendix G](https://arxiv.org/html/2606.17567#A7.p2.1)\.
- \[34\]C\. Zhang, C\. Guo, X\. Wang, S\. Ding, F\. Wu, and D\. Zhang\(2025\)Robust decorrelated stochastic configuration networks ensemble via weighted negative correlation learning\.IEEE Transactions on Systems, Man, and Cybernetics: Systems\.Cited by:[Appendix G](https://arxiv.org/html/2606.17567#A7.p2.1)\.

## Appendix AProof of Proposition[2\.1](https://arxiv.org/html/2606.17567#S2.Thmtheorem1)

###### Proof\.

Since𝐔k⊤​𝐔k=𝐈k\\mathbf\{U\}\_\{k\}^\{\\top\}\\mathbf\{U\}\_\{k\}=\\mathbf\{I\}\_\{k\}, we have

𝐏k2\\displaystyle\\mathbf\{P\}\_\{k\}^\{2\}=\(𝐔k​𝐔k⊤\)​\(𝐔k​𝐔k⊤\)\\displaystyle=\(\\mathbf\{U\}\_\{k\}\\mathbf\{U\}\_\{k\}^\{\\top\}\)\(\\mathbf\{U\}\_\{k\}\\mathbf\{U\}\_\{k\}^\{\\top\}\)=𝐔k​\(𝐔k⊤​𝐔k\)​𝐔k⊤\\displaystyle=\\mathbf\{U\}\_\{k\}\(\\mathbf\{U\}\_\{k\}^\{\\top\}\\mathbf\{U\}\_\{k\}\)\\mathbf\{U\}\_\{k\}^\{\\top\}=𝐔k​𝐈k​𝐔k⊤=𝐏k,\\displaystyle=\\mathbf\{U\}\_\{k\}\\mathbf\{I\}\_\{k\}\\mathbf\{U\}\_\{k\}^\{\\top\}=\\mathbf\{P\}\_\{k\},and

𝐏k⊤=\(𝐔k​𝐔k⊤\)⊤=𝐔k​𝐔k⊤=𝐏k\.\\mathbf\{P\}\_\{k\}^\{\\top\}=\(\\mathbf\{U\}\_\{k\}\\mathbf\{U\}\_\{k\}^\{\\top\}\)^\{\\top\}=\\mathbf\{U\}\_\{k\}\\mathbf\{U\}\_\{k\}^\{\\top\}=\\mathbf\{P\}\_\{k\}\.Thus𝐏k\\mathbf\{P\}\_\{k\}is the orthogonal projector ontoℋt−1=span⁡\(𝐔k\)\\mathcal\{H\}\_\{t\-1\}=\\operatorname\{span\}\(\\mathbf\{U\}\_\{k\}\)\. Moreover,

𝐐k2\\displaystyle\\mathbf\{Q\}\_\{k\}^\{2\}=\(𝐈n−𝐏k\)2\\displaystyle=\(\\mathbf\{I\}\_\{n\}\-\\mathbf\{P\}\_\{k\}\)^\{2\}=𝐈n−2​𝐏k\+𝐏k2\\displaystyle=\\mathbf\{I\}\_\{n\}\-2\\mathbf\{P\}\_\{k\}\+\\mathbf\{P\}\_\{k\}^\{2\}=𝐈n−𝐏k=𝐐k,\\displaystyle=\\mathbf\{I\}\_\{n\}\-\\mathbf\{P\}\_\{k\}=\\mathbf\{Q\}\_\{k\},and

𝐐k⊤=\(𝐈n−𝐏k\)⊤=𝐈n−𝐏k⊤=𝐈n−𝐏k=𝐐k\.\\mathbf\{Q\}\_\{k\}^\{\\top\}=\(\\mathbf\{I\}\_\{n\}\-\\mathbf\{P\}\_\{k\}\)^\{\\top\}=\\mathbf\{I\}\_\{n\}\-\\mathbf\{P\}\_\{k\}^\{\\top\}=\\mathbf\{I\}\_\{n\}\-\\mathbf\{P\}\_\{k\}=\\mathbf\{Q\}\_\{k\}\.Hence𝐐k\\mathbf\{Q\}\_\{k\}is the orthogonal projector ontoℋt−1⟂\\mathcal\{H\}\_\{t\-1\}^\{\\perp\}\.

For any feasible𝐳\\mathbf\{z\},𝐳∈ℋt−1⟂\\mathbf\{z\}\\in\\mathcal\{H\}\_\{t\-1\}^\{\\perp\}, so𝐐k​𝐳=𝐳\\mathbf\{Q\}\_\{k\}\\mathbf\{z\}=\\mathbf\{z\}and𝐏k​𝐳=𝟎\\mathbf\{P\}\_\{k\}\\mathbf\{z\}=\\mathbf\{0\}\. Since

𝐫\(t\)=𝐏k​𝐫\(t\)\+𝐐k​𝐫\(t\),\\mathbf\{r\}^\{\(t\)\}=\\mathbf\{P\}\_\{k\}\\mathbf\{r\}^\{\(t\)\}\+\\mathbf\{Q\}\_\{k\}\\mathbf\{r\}^\{\(t\)\},we obtain

\|𝐳−𝐫\(t\)\|22\\displaystyle\|\\mathbf\{z\}\-\\mathbf\{r\}^\{\(t\)\}\|\_\{2\}^\{2\}=\|𝐳−𝐏k​𝐫\(t\)−𝐐k​𝐫\(t\)\|22\\displaystyle=\|\\mathbf\{z\}\-\\mathbf\{P\}\_\{k\}\\mathbf\{r\}^\{\(t\)\}\-\\mathbf\{Q\}\_\{k\}\\mathbf\{r\}^\{\(t\)\}\|\_\{2\}^\{2\}=\|\(𝐳−𝐐k​𝐫\(t\)\)−𝐏k​𝐫\(t\)\|22\.\\displaystyle=\|\(\\mathbf\{z\}\-\\mathbf\{Q\}\_\{k\}\\mathbf\{r\}^\{\(t\)\}\)\-\\mathbf\{P\}\_\{k\}\\mathbf\{r\}^\{\(t\)\}\|\_\{2\}^\{2\}\.Because𝐳−𝐐k​𝐫\(t\)∈ℋt−1⟂\\mathbf\{z\}\-\\mathbf\{Q\}\_\{k\}\\mathbf\{r\}^\{\(t\)\}\\in\\mathcal\{H\}\_\{t\-1\}^\{\\perp\}and𝐏k​𝐫\(t\)∈ℋt−1\\mathbf\{P\}\_\{k\}\\mathbf\{r\}^\{\(t\)\}\\in\\mathcal\{H\}\_\{t\-1\},

⟨𝐳−𝐐k​𝐫\(t\),𝐏k​𝐫\(t\)⟩=0\.\\langle\\mathbf\{z\}\-\\mathbf\{Q\}\_\{k\}\\mathbf\{r\}^\{\(t\)\},\\mathbf\{P\}\_\{k\}\\mathbf\{r\}^\{\(t\)\}\\rangle=0\.Therefore,

\|𝐳−𝐫\(t\)\|22=\|𝐳−𝐐k​𝐫\(t\)\|22\+\|𝐏k​𝐫\(t\)\|22\.\|\\mathbf\{z\}\-\\mathbf\{r\}^\{\(t\)\}\|\_\{2\}^\{2\}=\|\\mathbf\{z\}\-\\mathbf\{Q\}\_\{k\}\\mathbf\{r\}^\{\(t\)\}\|\_\{2\}^\{2\}\+\|\\mathbf\{P\}\_\{k\}\\mathbf\{r\}^\{\(t\)\}\|\_\{2\}^\{2\}\.The second term is independent of𝐳\\mathbf\{z\}, and the first term is uniquely minimized at𝐳=𝐐k​𝐫\(t\)\\mathbf\{z\}=\\mathbf\{Q\}\_\{k\}\\mathbf\{r\}^\{\(t\)\}\. Hence

𝐫~\(t\)\\displaystyle\\tilde\{\\mathbf\{r\}\}^\{\(t\)\}=𝐐k​𝐫\(t\)=arg⁡min𝐳∈ℝn\|𝐳−𝐫\(t\)\|22\\displaystyle=\\mathbf\{Q\}\_\{k\}\\mathbf\{r\}^\{\(t\)\}=\\mathop\{\\arg\\min\}\_\{\\mathbf\{z\}\\in\\mathbb\{R\}^\{n\}\}\|\\mathbf\{z\}\-\\mathbf\{r\}^\{\(t\)\}\|\_\{2\}^\{2\}s\.t\.⟨𝐳,𝐮⟩=0,∀𝐮∈ℋt−1\.\\displaystyle\\quad\\text\{s\.t\.\}\\quad\\langle\\mathbf\{z\},\\mathbf\{u\}\\rangle=0,\\ \\forall\\mathbf\{u\}\\in\\mathcal\{H\}\_\{t\-1\}\.
Next,

\|𝐫\(t\)\|22\\displaystyle\|\\mathbf\{r\}^\{\(t\)\}\|\_\{2\}^\{2\}=\|𝐏k​𝐫\(t\)\+𝐐k​𝐫\(t\)\|22\\displaystyle=\|\\mathbf\{P\}\_\{k\}\\mathbf\{r\}^\{\(t\)\}\+\\mathbf\{Q\}\_\{k\}\\mathbf\{r\}^\{\(t\)\}\|\_\{2\}^\{2\}=\|𝐏k​𝐫\(t\)\|22\+\|𝐐k​𝐫\(t\)\|22\+2​⟨𝐏k​𝐫\(t\),𝐐k​𝐫\(t\)⟩\.\\displaystyle=\|\\mathbf\{P\}\_\{k\}\\mathbf\{r\}^\{\(t\)\}\|\_\{2\}^\{2\}\+\|\\mathbf\{Q\}\_\{k\}\\mathbf\{r\}^\{\(t\)\}\|\_\{2\}^\{2\}\+2\\langle\\mathbf\{P\}\_\{k\}\\mathbf\{r\}^\{\(t\)\},\\mathbf\{Q\}\_\{k\}\\mathbf\{r\}^\{\(t\)\}\\rangle\.The cross term satisfies

⟨𝐏k​𝐫\(t\),𝐐k​𝐫\(t\)⟩\\displaystyle\\langle\\mathbf\{P\}\_\{k\}\\mathbf\{r\}^\{\(t\)\},\\mathbf\{Q\}\_\{k\}\\mathbf\{r\}^\{\(t\)\}\\rangle=\(𝐫\(t\)\)⊤​𝐏k⊤​𝐐k​𝐫\(t\)\\displaystyle=\(\\mathbf\{r\}^\{\(t\)\}\)^\{\\top\}\\mathbf\{P\}\_\{k\}^\{\\top\}\\mathbf\{Q\}\_\{k\}\\mathbf\{r\}^\{\(t\)\}=\(𝐫\(t\)\)⊤​𝐏k​\(𝐈n−𝐏k\)​𝐫\(t\)\\displaystyle=\(\\mathbf\{r\}^\{\(t\)\}\)^\{\\top\}\\mathbf\{P\}\_\{k\}\(\\mathbf\{I\}\_\{n\}\-\\mathbf\{P\}\_\{k\}\)\\mathbf\{r\}^\{\(t\)\}=\(𝐫\(t\)\)⊤​\(𝐏k−𝐏k2\)​𝐫\(t\)=0\.\\displaystyle=\(\\mathbf\{r\}^\{\(t\)\}\)^\{\\top\}\(\\mathbf\{P\}\_\{k\}\-\\mathbf\{P\}\_\{k\}^\{2\}\)\\mathbf\{r\}^\{\(t\)\}=0\.Thus

\|𝐫\(t\)\|22=\|𝐏k​𝐫\(t\)\|22\+\|𝐐k​𝐫\(t\)\|22\.\|\\mathbf\{r\}^\{\(t\)\}\|\_\{2\}^\{2\}=\|\\mathbf\{P\}\_\{k\}\\mathbf\{r\}^\{\(t\)\}\|\_\{2\}^\{2\}\+\|\\mathbf\{Q\}\_\{k\}\\mathbf\{r\}^\{\(t\)\}\|\_\{2\}^\{2\}\.Since𝐏k​𝐫\(t\)=∑i=1k\(𝐮i⊤​𝐫\(t\)\)​𝐮i\\mathbf\{P\}\_\{k\}\\mathbf\{r\}^\{\(t\)\}=\\sum\_\{i=1\}^\{k\}\(\\mathbf\{u\}\_\{i\}^\{\\top\}\\mathbf\{r\}^\{\(t\)\}\)\\mathbf\{u\}\_\{i\}and\{𝐮i\}i=1k\\\{\\mathbf\{u\}\_\{i\}\\\}\_\{i=1\}^\{k\}is orthonormal,

\|𝐏k​𝐫\(t\)\|22\\displaystyle\|\\mathbf\{P\}\_\{k\}\\mathbf\{r\}^\{\(t\)\}\|\_\{2\}^\{2\}=\|∑i=1k\(𝐮i⊤​𝐫\(t\)\)​𝐮i\|22\\displaystyle=\\left\|\\sum\_\{i=1\}^\{k\}\(\\mathbf\{u\}\_\{i\}^\{\\top\}\\mathbf\{r\}^\{\(t\)\}\)\\mathbf\{u\}\_\{i\}\\right\|\_\{2\}^\{2\}=∑i=1k\(𝐮i⊤​𝐫\(t\)\)2\.\\displaystyle=\\sum\_\{i=1\}^\{k\}\(\\mathbf\{u\}\_\{i\}^\{\\top\}\\mathbf\{r\}^\{\(t\)\}\)^\{2\}\.Therefore,

\|𝐫~\(t\)\|22\\displaystyle\|\\tilde\{\\mathbf\{r\}\}^\{\(t\)\}\|\_\{2\}^\{2\}=\|𝐐k​𝐫\(t\)\|22\\displaystyle=\|\\mathbf\{Q\}\_\{k\}\\mathbf\{r\}^\{\(t\)\}\|\_\{2\}^\{2\}=\|𝐫\(t\)\|22−\|𝐏k​𝐫\(t\)\|22\\displaystyle=\|\\mathbf\{r\}^\{\(t\)\}\|\_\{2\}^\{2\}\-\|\\mathbf\{P\}\_\{k\}\\mathbf\{r\}^\{\(t\)\}\|\_\{2\}^\{2\}=\|𝐫\(t\)\|22−∑i=1k\(𝐮i⊤​𝐫\(t\)\)2\.\\displaystyle=\|\\mathbf\{r\}^\{\(t\)\}\|\_\{2\}^\{2\}\-\\sum\_\{i=1\}^\{k\}\(\\mathbf\{u\}\_\{i\}^\{\\top\}\\mathbf\{r\}^\{\(t\)\}\)^\{2\}\.∎

## Appendix BProof of Proposition[2\.2](https://arxiv.org/html/2606.17567#S2.Thmtheorem2)

###### Proof\.

Since𝐏k\\mathbf\{P\}\_\{k\}is a fixed rank\-kkorthogonal projector,𝐏k⊤=𝐏k\\mathbf\{P\}\_\{k\}^\{\\top\}=\\mathbf\{P\}\_\{k\},𝐏k2=𝐏k\\mathbf\{P\}\_\{k\}^\{2\}=\\mathbf\{P\}\_\{k\}, andrank⁡\(𝐏k\)=k\\operatorname\{rank\}\(\\mathbf\{P\}\_\{k\}\)=k\. Hence𝐐k=𝐈n−𝐏k\\mathbf\{Q\}\_\{k\}=\\mathbf\{I\}\_\{n\}\-\\mathbf\{P\}\_\{k\}satisfies

𝐐k⊤\\displaystyle\\mathbf\{Q\}\_\{k\}^\{\\top\}=𝐐k,\\displaystyle=\\mathbf\{Q\}\_\{k\},𝐐k2\\displaystyle\\mathbf\{Q\}\_\{k\}^\{2\}=𝐐k,\\displaystyle=\\mathbf\{Q\}\_\{k\},rank⁡\(𝐐k\)\\displaystyle\\operatorname\{rank\}\(\\mathbf\{Q\}\_\{k\}\)=n−k=d\.\\displaystyle=n\-k=d\.Therefore, there exists an orthogonal matrix𝐑∈ℝn×n\\mathbf\{R\}\\in\\mathbb\{R\}^\{n\\times n\}such that

𝐐k=𝐑​\[𝐈d𝟎𝟎𝟎\]​𝐑⊤\.\\mathbf\{Q\}\_\{k\}=\\mathbf\{R\}\\begin\{bmatrix\}\\mathbf\{I\}\_\{d\}&\\mathbf\{0\}\\\\ \\mathbf\{0\}&\\mathbf\{0\}\\end\{bmatrix\}\\mathbf\{R\}^\{\\top\}\.Let𝐠=ν−1​𝐑⊤​ϵ\\mathbf\{g\}=\\nu^\{\-1\}\\mathbf\{R\}^\{\\top\}\\boldsymbol\{\\epsilon\}\. Sinceϵ∼𝒩​\(𝟎,ν2​𝐈n\)\\boldsymbol\{\\epsilon\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\nu^\{2\}\\mathbf\{I\}\_\{n\}\)and𝐑\\mathbf\{R\}is orthogonal,𝐠∼𝒩​\(𝟎,𝐈n\)\\mathbf\{g\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\_\{n\}\)\. Thus

\|𝐐k​ϵ\|22\\displaystyle\|\\mathbf\{Q\}\_\{k\}\\boldsymbol\{\\epsilon\}\|\_\{2\}^\{2\}=ϵ⊤​𝐐k⊤​𝐐k​ϵ=ϵ⊤​𝐐k​ϵ\\displaystyle=\\boldsymbol\{\\epsilon\}^\{\\top\}\\mathbf\{Q\}\_\{k\}^\{\\top\}\\mathbf\{Q\}\_\{k\}\\boldsymbol\{\\epsilon\}=\\boldsymbol\{\\epsilon\}^\{\\top\}\\mathbf\{Q\}\_\{k\}\\boldsymbol\{\\epsilon\}=ν2​𝐠⊤​\[𝐈d𝟎𝟎𝟎\]​𝐠\\displaystyle=\\nu^\{2\}\\mathbf\{g\}^\{\\top\}\\begin\{bmatrix\}\\mathbf\{I\}\_\{d\}&\\mathbf\{0\}\\\\ \\mathbf\{0\}&\\mathbf\{0\}\\end\{bmatrix\}\\mathbf\{g\}=ν2​∑i=1dgi2\.\\displaystyle=\\nu^\{2\}\\sum\_\{i=1\}^\{d\}g\_\{i\}^\{2\}\.Sincegi​∼i\.i\.d\.​𝒩​\(0,1\)g\_\{i\}\\overset\{i\.i\.d\.\}\{\\sim\}\\mathcal\{N\}\(0,1\),∑i=1dgi2∼χd2\\sum\_\{i=1\}^\{d\}g\_\{i\}^\{2\}\\sim\\chi\_\{d\}^\{2\}\. Therefore,

𝔼​\|𝐐k​ϵ\|22\\displaystyle\\mathbb\{E\}\|\\mathbf\{Q\}\_\{k\}\\boldsymbol\{\\epsilon\}\|\_\{2\}^\{2\}=ν2​𝔼​\[∑i=1dgi2\]\\displaystyle=\\nu^\{2\}\\mathbb\{E\}\\left\[\\sum\_\{i=1\}^\{d\}g\_\{i\}^\{2\}\\right\]=ν2​∑i=1d𝔼​\[gi2\]=d​ν2\.\\displaystyle=\\nu^\{2\}\\sum\_\{i=1\}^\{d\}\\mathbb\{E\}\[g\_\{i\}^\{2\}\]=d\\nu^\{2\}\.Also,

𝔼​\|ϵ\|22=∑i=1n𝔼​\[ϵi2\]=n​ν2\.\\mathbb\{E\}\|\\boldsymbol\{\\epsilon\}\|\_\{2\}^\{2\}=\\sum\_\{i=1\}^\{n\}\\mathbb\{E\}\[\\epsilon\_\{i\}^\{2\}\]=n\\nu^\{2\}\.Hence

𝔼​\|𝐐k​ϵ\|22\\displaystyle\\mathbb\{E\}\|\\mathbf\{Q\}\_\{k\}\\boldsymbol\{\\epsilon\}\|\_\{2\}^\{2\}=d​ν2=\(n−k\)​ν2\\displaystyle=d\\nu^\{2\}=\(n\-k\)\\nu^\{2\}=\(1−kn\)​n​ν2\\displaystyle=\\left\(1\-\\frac\{k\}\{n\}\\right\)n\\nu^\{2\}=\(1−kn\)​𝔼​\|ϵ\|22\.\\displaystyle=\\left\(1\-\\frac\{k\}\{n\}\\right\)\\mathbb\{E\}\|\\boldsymbol\{\\epsilon\}\|\_\{2\}^\{2\}\.
LetX=∑i=1dgi2∼χd2X=\\sum\_\{i=1\}^\{d\}g\_\{i\}^\{2\}\\sim\\chi\_\{d\}^\{2\}\. For anyx\>0x\>0,

ℙ​\(X≥d\+2​d​x\+2​x\)≤e−x\.\\mathbb\{P\}\\left\(X\\geq d\+2\\sqrt\{dx\}\+2x\\right\)\\leq e^\{\-x\}\.Takingx=log⁡\(1/δ\)x=\\log\(1/\\delta\)gives

ℙ​\(X≤d\+2​d​log⁡\(1/δ\)\+2​log⁡\(1/δ\)\)≥1−δ\.\\mathbb\{P\}\\left\(X\\leq d\+2\\sqrt\{d\\log\(1/\\delta\)\}\+2\\log\(1/\\delta\)\\right\)\\geq 1\-\\delta\.Since\|𝐐k​ϵ\|22=ν2​X\|\\mathbf\{Q\}\_\{k\}\\boldsymbol\{\\epsilon\}\|\_\{2\}^\{2\}=\\nu^\{2\}X, with probability at least1−δ1\-\\delta,

\|𝐐k​ϵ\|22≤ν2​\[d\+2​d​log⁡\(1/δ\)\+2​log⁡\(1/δ\)\]\.\|\\mathbf\{Q\}\_\{k\}\\boldsymbol\{\\epsilon\}\|\_\{2\}^\{2\}\\leq\\nu^\{2\}\\left\[d\+2\\sqrt\{d\\log\(1/\\delta\)\}\+2\\log\(1/\\delta\)\\right\]\.
Next, since𝐫=𝐬\+ϵ\\mathbf\{r\}=\\mathbf\{s\}\+\\boldsymbol\{\\epsilon\},

𝐐k​𝐫=𝐐k​𝐬\+𝐐k​ϵ\.\\mathbf\{Q\}\_\{k\}\\mathbf\{r\}=\\mathbf\{Q\}\_\{k\}\\mathbf\{s\}\+\\mathbf\{Q\}\_\{k\}\\boldsymbol\{\\epsilon\}\.Let𝐚=𝐐k​𝐬\\mathbf\{a\}=\\mathbf\{Q\}\_\{k\}\\mathbf\{s\}and𝐛=𝐐k​ϵ\\mathbf\{b\}=\\mathbf\{Q\}\_\{k\}\\boldsymbol\{\\epsilon\}\. Then

\|𝐐k​𝐫\|22=\|𝐚\+𝐛\|22=\|𝐚\|22\+2​⟨𝐚,𝐛⟩\+\|𝐛\|22\.\|\\mathbf\{Q\}\_\{k\}\\mathbf\{r\}\|\_\{2\}^\{2\}=\|\\mathbf\{a\}\+\\mathbf\{b\}\|\_\{2\}^\{2\}=\|\\mathbf\{a\}\|\_\{2\}^\{2\}\+2\\langle\\mathbf\{a\},\\mathbf\{b\}\\rangle\+\|\\mathbf\{b\}\|\_\{2\}^\{2\}\.For anyη\>0\\eta\>0,

0\\displaystyle 0≤\|η​𝐚−η−1/2​𝐛\|22\\displaystyle\\leq\|\\sqrt\{\\eta\}\\mathbf\{a\}\-\\eta^\{\-1/2\}\\mathbf\{b\}\|\_\{2\}^\{2\}=η​\|𝐚\|22−2​⟨𝐚,𝐛⟩\+η−1​\|𝐛\|22\.\\displaystyle=\\eta\|\\mathbf\{a\}\|\_\{2\}^\{2\}\-2\\langle\\mathbf\{a\},\\mathbf\{b\}\\rangle\+\\eta^\{\-1\}\|\\mathbf\{b\}\|\_\{2\}^\{2\}\.Thus

2​⟨𝐚,𝐛⟩≤η​\|𝐚\|22\+η−1​\|𝐛\|22\.2\\langle\\mathbf\{a\},\\mathbf\{b\}\\rangle\\leq\\eta\|\\mathbf\{a\}\|\_\{2\}^\{2\}\+\\eta^\{\-1\}\|\\mathbf\{b\}\|\_\{2\}^\{2\}\.Therefore,

\|𝐐k​𝐫\|22\\displaystyle\|\\mathbf\{Q\}\_\{k\}\\mathbf\{r\}\|\_\{2\}^\{2\}≤\(1\+η\)​\|𝐚\|22\+\(1\+η−1\)​\|𝐛\|22\\displaystyle\\leq\(1\+\\eta\)\|\\mathbf\{a\}\|\_\{2\}^\{2\}\+\(1\+\\eta^\{\-1\}\)\|\\mathbf\{b\}\|\_\{2\}^\{2\}=\(1\+η\)​\|𝐐k​𝐬\|22\+\(1\+η−1\)​\|𝐐k​ϵ\|22\.\\displaystyle=\(1\+\\eta\)\|\\mathbf\{Q\}\_\{k\}\\mathbf\{s\}\|\_\{2\}^\{2\}\+\(1\+\\eta^\{\-1\}\)\|\\mathbf\{Q\}\_\{k\}\\boldsymbol\{\\epsilon\}\|\_\{2\}^\{2\}\.
Finally, assume0≤k<n0\\leq k<nand𝐬≠𝟎\\mathbf\{s\}\\neq\\mathbf\{0\}\. Since𝐏k\\mathbf\{P\}\_\{k\}and𝐐k\\mathbf\{Q\}\_\{k\}are orthogonal complementary projectors,

\|𝐬\|22=\|𝐏k​𝐬\|22\+\|𝐐k​𝐬\|22\.\|\\mathbf\{s\}\|\_\{2\}^\{2\}=\|\\mathbf\{P\}\_\{k\}\\mathbf\{s\}\|\_\{2\}^\{2\}\+\|\\mathbf\{Q\}\_\{k\}\\mathbf\{s\}\|\_\{2\}^\{2\}\.Byρ=\|𝐏k​𝐬\|22/\|𝐬\|22\\rho=\|\\mathbf\{P\}\_\{k\}\\mathbf\{s\}\|\_\{2\}^\{2\}/\|\\mathbf\{s\}\|\_\{2\}^\{2\}, we have

\|𝐐k​𝐬\|22=\|𝐬\|22−\|𝐏k​𝐬\|22=\(1−ρ\)​\|𝐬\|22\.\|\\mathbf\{Q\}\_\{k\}\\mathbf\{s\}\|\_\{2\}^\{2\}=\|\\mathbf\{s\}\|\_\{2\}^\{2\}\-\|\\mathbf\{P\}\_\{k\}\\mathbf\{s\}\|\_\{2\}^\{2\}=\(1\-\\rho\)\|\\mathbf\{s\}\|\_\{2\}^\{2\}\.Therefore,

SNR⁡\(𝐐k​𝐫\)SNR⁡\(𝐫\)\\displaystyle\\frac\{\\operatorname\{SNR\}\(\\mathbf\{Q\}\_\{k\}\\mathbf\{r\}\)\}\{\\operatorname\{SNR\}\(\\mathbf\{r\}\)\}=\|𝐐k​𝐬\|22/\(\(n−k\)​ν2\)\|𝐬\|22/\(n​ν2\)\\displaystyle=\\frac\{\|\\mathbf\{Q\}\_\{k\}\\mathbf\{s\}\|\_\{2\}^\{2\}/\(\(n\-k\)\\nu^\{2\}\)\}\{\|\\mathbf\{s\}\|\_\{2\}^\{2\}/\(n\\nu^\{2\}\)\}=n​\|𝐐k​𝐬\|22\(n−k\)​\|𝐬\|22\\displaystyle=\\frac\{n\|\\mathbf\{Q\}\_\{k\}\\mathbf\{s\}\|\_\{2\}^\{2\}\}\{\(n\-k\)\|\\mathbf\{s\}\|\_\{2\}^\{2\}\}=n​\(1−ρ\)​\|𝐬\|22\(n−k\)​\|𝐬\|22=1−ρ1−k/n\.\\displaystyle=\\frac\{n\(1\-\\rho\)\|\\mathbf\{s\}\|\_\{2\}^\{2\}\}\{\(n\-k\)\|\\mathbf\{s\}\|\_\{2\}^\{2\}\}=\\frac\{1\-\\rho\}\{1\-k/n\}\.Thus

SNR⁡\(𝐐k​𝐫\)\>SNR⁡\(𝐫\)\\displaystyle\\operatorname\{SNR\}\(\\mathbf\{Q\}\_\{k\}\\mathbf\{r\}\)\>\\operatorname\{SNR\}\(\\mathbf\{r\}\)⟺\\displaystyle\\Longleftrightarrow1−ρ1−k/n\>1\\displaystyle\\frac\{1\-\\rho\}\{1\-k/n\}\>1⟺\\displaystyle\\Longleftrightarrow1−ρ\>1−kn⟺ρ<kn\.\\displaystyle 1\-\\rho\>1\-\\frac\{k\}\{n\}\\quad\\Longleftrightarrow\\quad\\rho<\\frac\{k\}\{n\}\.∎

## Appendix CSpectral Evidence of Prediction Diversity

To examine the structure of the learned ensemble, Figure[5](https://arxiv.org/html/2606.17567#A3.F5)compared the normalized singular value spectra of the prediction history matrices for SCBoost and GBDT on the Jasmine dataset\. GBDT showed a faster spectral decay, indicating that its prediction history was more concentrated in the leading singular directions\. In contrast, SCBoost produced a flatter average spectrum across 10\-fold cross\-validation, suggesting that its learners covered a broader empirical prediction subspace\. This observation was consistent with the role of SRP, which removes the selected historical component from the residual target before fitting the next learner\. The result should be interpreted as empirical evidence of reduced prediction redundancy, not as a direct validation of the fixed\-subspace SNR analysis\.

![Refer to caption](https://arxiv.org/html/2606.17567v1/x4.png)

Figure 5:Normalized singular value spectra of prediction history matrices for SCBoost and GBDT on the Jasmine dataset, averaged over 10\-fold cross\-validation\.
## Appendix DEffect of Base Learner Capacity

We further evaluated the effect of base learner depth on SCBoost using the Jasmine dataset\. The energy threshold was fixed toα=0\.9\\alpha=0\.9, and the tree depth was varied overD∈\[1,8\]D\\in\[1,8\]\. Figure[6](https://arxiv.org/html/2606.17567#A4.F6)reported test accuracy and the redundancy ratio\. Accuracy increased from shallow trees to moderate depths and reached its best value atD=5D=5\. The redundancy ratio decreased over the same range, suggesting that moderate\-capacity learners better fitted the projected residual targets\. When the depth was further increased, accuracy did not continue to improve\. These results indicated that SCBoost benefited from sufficient learner capacity, but deeper trees did not necessarily provide better generalization\.

![Refer to caption](https://arxiv.org/html/2606.17567v1/x5.png)

Figure 6:Effect of base learner depth on SCBoost on the Jasmine dataset\. We reported test accuracy and redundancy ratio for tree depthsD∈\[1,8\]D\\in\[1,8\]withα=0\.9\\alpha=0\.9\.
## Appendix ETime Complexity Analysis of SCBoost

Understanding the computational characteristics of SCBoost is essential for assessing its scalability\. The time complexity is composed of three main components acrossTTiterations:

- •Weak Learner Training:Training a regression tree of depthDDat each iteration takesO​\(n​d​log⁡n\)O\(nd\\log n\), whereddis the number of features\. OverTTiterations, this sums toO​\(T​n​d​log⁡n\)O\(Tnd\\log n\)\.
- •SRP:At iterationtt, SRP performs a truncated SVD on the prediction history matrix𝐇\(t−1\)∈ℝn×\(t−1\)\\mathbf\{H\}^\{\(t\-1\)\}\\in\\mathbb\{R\}^\{n\\times\(t\-1\)\}\. Note that while our implementation stores predictions instead of residuals, the dimensions remain identical to the residual history case\. Thus, the computational cost remainsO​\(n​\(t−1\)2\)O\(n\(t\-1\)^\{2\}\)per iteration\. The cumulative cost overTTiterations is ∑t=1TO​\(n​t2\)=n​∑t=1Tt2≈O​\(n​T3\)\.\\sum\_\{t=1\}^\{T\}O\(nt^\{2\}\)=n\\sum\_\{t=1\}^\{T\}t^\{2\}\\approx O\(nT^\{3\}\)\.\(2\)
- •CRW:Solving the convex optimization problem for the final weights involves the covariance matrix of sizeT×TT\\times T, requiringO​\(n​T2\+T3\)O\(nT^\{2\}\+T^\{3\}\)\.

Combining these components, the overall time complexity of SCBoost is dominated by the SRP step:

O​\(n​T3\+T​n​d​log⁡n\)\.O\(nT^\{3\}\+Tnd\\log n\)\.\(3\)
To evaluate practical implications, Table[5](https://arxiv.org/html/2606.17567#A5.T5)shows the wall\-clock training time of SCBoost against several baselines on all five datasets, using the same computational environment\. The results revealed a critical limitation of our current SCBoost implementation\. Compared to highly optimized frameworks like LGBM and XGB, SCBoost was consistently one to two orders of magnitude slower\. For instance, on the OVA\_Uterus dataset, SCBoost took approximately 38\.3 seconds, whereas LGBM and XGB finished in just 3\.3 and 5\.4 seconds, respectively\. This significant computational overhead stems primarily from the SRP step, which performs an SVD on an ever\-growing residual history matrix R at each iteration\. ThisO​\(n​\(t−1\)2\)O\(n\(t\-1\)^\{2\}\)cost per iteration quickly becomes a bottleneck, especially for largennorTT\.

Table 5:The wall\-clock time comparison in seconds across five benchmark datasets\. The values represent the total time required for training each model\.Analysis of the Bottleneck\.Unlike traditional boosting algorithms \(e\.g\., XGB, LGBM\) which scale linearly withTT\(O​\(T​n​d​log⁡n\)O\(Tnd\\log n\)\), SCBoost exhibits a cubic dependence on the number of iterations \(T3T^\{3\}\) and linear dependence on sample size \(nn\) for the SVD operations\. This theoretical overhead explains the wall\-clock time gap observed in Table[5](https://arxiv.org/html/2606.17567#A5.T5)\. While this cost is significant, it is the price paid for the strict residual orthogonalization that eliminates redundancy\. The memory complexity also grows asO​\(n​T\)O\(nT\), as the algorithm must store the full history of prediction vectors to maintain orthogonality\. Consequently, the current exact implementation of SCBoost is best suited for scenarios where model compactness and diversity are prioritized over training speed, or wherennandTTare moderate\.

## Appendix FParameter Configurations

To ensure a fair and reproducible comparison of the algorithms’ intrinsic capabilities, we fixed only the random seed \(random\_state=42\) and n\_estimators=100 for reproducibility, while keeping all other parameters at their official recommended values as provided by the respective software packages \(e\.g\., Scikit\-learn\)\. The proposed SCBoost algorithm was configured in the same way, following an official setup consistent with these baselines\. It is well known that leading ensemble methods such as XGB and LGBM are highly sensitive to hyperparameter tuning, and their performance can often be substantially improved through data\-specific optimization\. Our deliberate choice to use default parameters is intended to assess the baseline robustness and practical usability of each method out of the box\. This design isolates the intrinsic performance of the underlying architectures without the confounding effects of extensive, and often computationally expensive tuning, an important consideration for real\-world applications where such tuning may be infeasible\. For the high\-dimensional OVA\_Uterus dataset, however, NGB’s official configuration \(natural\_gradient=True\) consistently failed with a LinAlgError\. This issue, known to arise from the inversion of the Fisher Information Matrix in high\-dimensional settings, persisted even with official stabilization options\. To obtain a valid comparison, we disabled this mechanism by setting natural\_gradient=False, effectively reverting the model to standard gradient boosting\. Results obtained under this modification are denoted with an asterisk \(\*\) in our tables\.

To ensure a fair and rigorous comparison against the performance ceiling of each baseline model, we conducted an extensive hyperparameter search\. This process was automated using theOptunaframework\[[1](https://arxiv.org/html/2606.17567#bib.bib49)\], a SOTA Bayesian optimization library\. For each baseline algorithm, the optimization objective was tomaximize the mean accuracy scoreobtained through a10\-fold stratified cross\-validationprocedure on each dataset, with a fixed random seed \(‘rando\_state=42‘\) for reproducibility\. The optimization for each model on each dataset was performed for50 trials, where each trial corresponded to a unique set of hyperparameters selected by Optuna’s Tree\-structured Parzen Estimator sampler\. The specific hyperparameter search spaces were configured as follows\. ForRF,n\_estimatorswas searched in the integer range \[50, 200\] andmax\_depthin \[3, 8\]\. ForGBDT,XGB,LGBM, andCAT, we searchedn\_estimatorsin \[50, 200\],max\_depthin \[3, 8\], andlearning\_rateon a log\-uniform scale between10−310^\{\-3\}and 1\.0\. ForADAandNGB, we searchedn\_estimatorsin \[50, 200\] andlearning\_rateon a log\-uniform scale between10−310^\{\-3\}and 1\.0\. The set of hyperparameters yielding the highest cross\-validation accuracy was then used to report the final tuned performance in main Table 2\. As stated in the main text, SCBoost was not subjected to this tuning process and retained its default parameters for all comparisons\.

## Appendix GRelated Work

Boosting\.The history of boosting algorithms is a testament to the power of iterative improvement\. From the foundational AdaBoost\[[13](https://arxiv.org/html/2606.17567#bib.bib13)\]and Gradient Boosting Machines\[[14](https://arxiv.org/html/2606.17567#bib.bib14)\]to modern titans like XGBoost\[[9](https://arxiv.org/html/2606.17567#bib.bib1)\]and LightGBM\[[19](https://arxiv.org/html/2606.17567#bib.bib2)\], the dominant trajectory of innovation has been overwhelmingly focused on two axes: computational acceleration \(e\.g\., histogram\-based binning, optimized tree construction\) and specialized feature handling \(e\.g\., CatBoost’s target\-based encoding\)\[[25](https://arxiv.org/html/2606.17567#bib.bib15)\]\. While these engineering breakthroughs have made boosting scalable to massive datasets, and recent advancements like ITBoost\[[29](https://arxiv.org/html/2606.17567#bib.bib51)\]have introduced information\-theoretic approaches to enhance algorithmic robustness, they largely operate within the unchanged paradigm of sequential residual fitting\. Consequently, the fundamental issue of learner redundancy has been largely sidestepped\.

Ensemble diversity\.The quest for ensemble diversity has yielded two main approaches\. Randomization strategies\[[15](https://arxiv.org/html/2606.17567#bib.bib16)\], effective in parallel paradigms like bagging\[[3](https://arxiv.org/html/2606.17567#bib.bib17),[6](https://arxiv.org/html/2606.17567#bib.bib18)\], are merely passive heuristics in boosting, failing to counteract its sequentially\-induced redundancy\. More active methods, like NCL\[[8](https://arxiv.org/html/2606.17567#bib.bib19),[7](https://arxiv.org/html/2606.17567#bib.bib20),[33](https://arxiv.org/html/2606.17567#bib.bib21)\]and decorrelation penalties\[[32](https://arxiv.org/html/2606.17567#bib.bib22),[16](https://arxiv.org/html/2606.17567#bib.bib23),[34](https://arxiv.org/html/2606.17567#bib.bib24)\], introduce diversity as a soft constraint within the loss function\. This creates a contentious trade\-off between accuracy and diversity, and can weaken individual learners or complicate convergence\. Crucially, all these methods treat diversity as an external property to be encouraged, rather than an intrinsic goal of the learning target itself\. They try to influence the learners, but they do not modify the problem each learner is asked to solve\.

In summary, there exists a clear gap in the boosting paradigm: the lack of a principled, explicit, and direct mechanism that is architected specifically to eliminate redundancy during the sequential training process, without resorting to implicit loss\-level penalties or passive randomization\. The limitations of existing diversity\-promoting methods when applied to boosting underscore the necessity for a novel approach that operates on a different level of abstraction\. This critical gap motivates our work to fundamentally rethink how diversity is instilled in boosted ensembles\.

Similar Articles

Operator Boosting Produces Pareto-Efficient PDE Surrogates

arXiv cs.LG

Operator Boosting is a stagewise residual-learning framework that constructs compact neural operator surrogates for PDEs by training tiny models on residual fields. It achieves accuracy comparable to or better than full-size models while reducing parameters by up to 95%, demonstrating Pareto improvements on several benchmarks.