Double descent for least-squares interpolation on contaminated data: A simulation study
Summary
This simulation study examines the double descent phenomenon for least-squares interpolation on contaminated data in linear regression, comparing the performance of the least-squares interpolator with robust alternatives.
View Cached Full Text
Cached at: 05/22/26, 08:46 AM
# Double descent for least-squares interpolation on contaminated data: A simulation study
Source: [https://arxiv.org/html/2605.21494](https://arxiv.org/html/2605.21494)
\(April 15, 2026\)
###### Abstract
Overparametrized models can exhibit an excellent generalization performance, although they should be prone to overfitting according to classical statistical theory\. The discovery of the“double descent”, indicating that the generalization error decreases after a certain model complexity has been reached, opened a new line of research\. Robust statistics considers statistical estimation on contaminated data, which, due to assumptions that do not hold on real data, let data points appear as outliers w\.r\.t\. the assumed“ideal”distribution, potentially severely distorting any classical estimator\. We address the question whether a double descent phenomenon can be observed in a linear regression setting with contaminated training data\. We compare the performance of the highly non\-robust least\-squares interpolation estimator with several robust alternatives\. It turns out that large overparametrization indeed allows for a double descent phenomenon, resulting in a very good generalization performance of the least\-squares interpolator, surpassing that of the robust alternatives\.
Keywords:Robust statistics, double descent, interpolating regime, contaminated data
## 1Introduction
Classical statistical theory quantifies the generalization ability of machine learning algorithms in terms of the complexity of the considered model class, quantified by the Rademacher complexity, the Vapnik\-Chervonenkis dimensionvapnik71or the pseudo\-dimensionpol90\(e\.g\.,kolt01;kolt02;kolt06;bart02b;bart05b\)\. These results motivate to find an optimal complexity of the model class by a bias\-variance tradeoff, which decourages too sparse models due to a high bias but also too complex models due to a high variance\. As the complexity of neural network classes grows with their number of parameters \(bart03,maass\), generalization bounds for large neural networks are very loose and hence cannot explain their often excellent generalization ability\.
In a recent line of research, starting with the empirical workbelkin19and the theoretical workbelkin, the“double descent”phenomenon has been discovered, indicating that the generalization loss of a model class grows with increasing complexity, but drops as the model complexity further increases\. While the increase in the generalization loss is referred to as overfitting, the descent behavior has been coined as“interpolation regime”\. While the first works on this topics considered least\-squares interpolationbelkin;neal;muthu20, results for overparametrized classificationmuthu21;dar21and neural network classifiersfrei;frei23;george;zhu23followed\.
Robust statistics \(huber;hampel;rieder;maronna\) considers the analysis of contaminated data\. Such contamination arises from model misspecification, letting realizations from the unknown, real data distribution appear as outliers w\.r\.t\. the assumed“ideal”distribution\. Since classical, non\-robust estimators are prone to be distorted when applied on contaminated data, leading to a bad generalization ability, robust statistics provides methods in order to perform estimation on contaminated data while maintaining a good generalization ability, for the price of reduced efficiency\.
In this work, we purely empirically study the generalization performance of overparametrized models that have been trained on contaminated data and are evaluated on clean data\. We consider the least\-squares interpolator and different robust counterparts, namely regression with the Huber loss, with the Tukey loss, the sparse least trimmed squares \(SLTS;alfons13\) and robust Boosting \(RRBoost;ju21\)\. We further provide an idea that combines minimuml2l\_\{2\}\-norm interpolation with clean subset selection by first applying SLTS or RRBoost, respectively, and by training a minimuml2l\_\{2\}\-norm interpolator solely the identified clean subset\.
The paper is organized as follows\. In Sec\.[2](https://arxiv.org/html/2605.21494#S2), we list related work on the double descent phenomenon\. Sec\.[3](https://arxiv.org/html/2605.21494#S3)compiles basic concepts of robust statistics, overparametrized regression as well as our approaches where we first use a robust algorithm for identifying a clean subset and apply a minimuml2l\_\{2\}\-norm interpolator on this subset\. Sec\.[4](https://arxiv.org/html/2605.21494#S4)is devoted to a description of our simulation setting, the applied algorithms,𝖱\\mathsf\{R\}\-packages and the evaluation\. The results concerning test errors, training errors, coefficients and the number of iterations are graphically presented in Sec\.[5](https://arxiv.org/html/2605.21494#S5), Sec\.[6](https://arxiv.org/html/2605.21494#S6), Sec\.[7](https://arxiv.org/html/2605.21494#S7)and Sec\.[8](https://arxiv.org/html/2605.21494#S8), respectively\. We discuss the results and conclude in Sec\.[9](https://arxiv.org/html/2605.21494#S9)\.
## 2Related work
The experimental workbelkin19where the double descent phenomenon was discovered has been the starting point for some theoretical works on least\-squares interpolation\. Inbelkin, the prediction risk for the least\-squares estimator is computed, confirming a theoretical double descent provided a sufficiently high signal\-to\-noise ratio \(SNR\)\. A double descent can also occur w\.r\.t\. the number of epochs in neural network training \(nakkiran\)\.
In some works \(e\.g\.,belkin;neal;muthu20\), the test MSE for the minimuml2l\_\{2\}\-norm interpolator in a Gaussian setting is computed, showing that the MSE decreases forp\>np\>nwith growingpp\. Moreover, the test MSE can be decomposed into a noise\-dependent component, which depends on the noise variance and hence vanishes on noise\-free data, and a signal\-dependent component which depends on the norm of the coefficient vector\. Noise\-dependent and signal\-dependent error bounds have been computed inbartlett20;hastie22;muthu;dar21;tsigler\.
It has been shown indar21;bartlett21;tsiglerthat the double descent requires a low\-dimensional signal, i\.e\., it must be aligned with thes<<ns<<nhighest eigenvalues, as well as a low effective data dimension\. Further, it has been shown inhastie22that for growingpp, thel2l\_\{2\}\-norm of the minimuml2l\_\{2\}\-norm solution decreases due to more degrees of freedom, i\.e\., a largerppcan lead to a more regularized solution, reducing the variance\.
A decomposition of the entire model has been considered inbartlett21who show that for linear regression, the minimum\-norm interpolant can be decomposed into a prediction \(in the subspace of the firstsseigenvectors\) and an interpolation component, which they describe as not being useful but also not being harmful for the prediction, more precisely, whose contribution to the test error is small provided that the eigenvalues of the data matrix decay slowly\.
Overparametrized classification has been considered for example indeng22;montanari;kini;chatterji;muthu21\. As for neural networks, the double descent for shallow classification networks has been studied infrei;frei23;george;karhadkar, while deep ReLU neural networks have been considered inzhu23\.
As for noisy or contaminated data,liu22dpoint out that overfitting on contaminated data \(label noise\) leads to a poor generalization ability\. They argue that even robust losses like the absolute error do not solve the problem as over\-parametrized models would still interpolate and hence overfit the contamination component\.rahimi24experimentally study the multiple descent in unsupervised learning and the presence of Gaussian \(and hence not heavy\-tailed\) noise and domain shift, revealing that the multiple descent occurs even in the presence of heavy noise\. They use auto\-encoders and the double descent occurs both w\.r\.t\. the layer size as well as the epoch number\.park25consider overparametrized linear regression in the presence of noise, missing data andYY\-outliers\. They generalize the risk bounds frombartlett20in the setting of additiveXX\-contamination \(from a centered distribution\) and cells missing completely at random, and experimentally show the double descent in this setting\.singh22express the risk for NNs in terms of influence curves\. The resulting theorem\(singh22, Thm\. 5\)characterizes the MSE in terms ofnnandpp, confirming the peak atp≈np\\approx nand a decrease asp\>np\>n\.vilucchiostudy regression with the Huber loss and mention in the outlook that one could study the interplay of contamination with benign overfitting\.kausikconsider noisy inputs, i\.e\., that one only has access toX\+AX\+Afor a noise matrixAAthat stems from a rotationally bi\-invariant distribution whose eigenvalues still allow for the Marchenko\-Pastur law\. They show that the double descent also occurs for the minimuml2l\_\{2\}\-norm interpolator\.tripuraneniconsider covariate shifts in the form of a power shift law that can suppress the largest eigendirections\. They confirm robustness against such shifts for overparametrized linear models\.
## 3Preliminaries
### 3\.1Robust regression
Letnnandppdenote the number of instances and predictors, respectively\. In a linear regression setting, letD=\(X,Y\)∈ℝn×\(p\+1\)D=\(X,Y\)\\in\\mathbb\{R\}^\{n\\times\(p\+1\)\}be a data set whereY∈ℝnY\\in\\mathbb\{R\}^\{n\}is the response vector and whereX∈ℝn×pX\\in\\mathbb\{R\}^\{n\\times p\}is the predictor matrix\. We assume the linear structure
Yi=Xiβ\+ϵiY\_\{i\}=X\_\{i\}\\beta\+\\epsilon\_\{i\}\(3\.1\)for a coefficient vectorβ∈ℝp\\beta\\in\\mathbb\{R\}^\{p\}and for a noise termϵi\\epsilon\_\{i\}\. In the least\-squares setting, one assumesϵi∼𝒩\(0,σ2\)\\epsilon\_\{i\}\\sim\\mathcal\{N\}\(0,\\sigma^\{2\}\)i\.i\.d\.\.
The least\-squares estimator is, in the classical setting wherep<np<n, given by
β^LS=argminβ∈ℝp\(1n∑i=1n\(Yi−Xiβ\)2\)=\(XTX\)−1XTY\.\\hat\{\\beta\}^\{\\text\{LS\}\}=\\operatorname\*\{argmin\}\_\{\\beta\\in\\mathbb\{R\}^\{p\}\}\\left\(\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\(Y\_\{i\}\-X\_\{i\}\\beta\)^\{2\}\\right\)=\(X^\{T\}X\)^\{\-1\}X^\{T\}Y\.\(3\.2\)This estimator is the maximum likelihood estimator in the Gaussian setting and therefore highly efficient, but also highly non\-robust on contaminated data\. See\(rieder, Sec\. 4\.2\)for the following definition\.
###### Definition 3\.1\.
Let\(Ω,𝒜\)\(\\Omega,\\mathcal\{A\}\)be a measurable space and let𝒫:=\{Pθ\|θ∈Θ\}\\mathcal\{P\}:=\\\{P\_\{\\theta\}\\ \|\\ \\theta\\in\\Theta\\\}be a parametric model, with the parameter spaceΘ⊂ℝp\\Theta\\subset\\mathbb\{R\}^\{p\}\. EachPθP\_\{\\theta\}denotes a distribution on\(Ω,𝒜\)\(\\Omega,\\mathcal\{A\}\)\. LetPθ0P\_\{\\theta\_\{0\}\}be the ideal distribution\. The set of all distributions of the form
𝒰c\(θ0\):=\{Uc\(θ0,r\)\|r∈\[0,∞\[\}\\displaystyle\\mathcal\{U\}\_\{c\}\(\\theta\_\{0\}\):=\\\{U\_\{c\}\(\\theta\_\{0\},r\)\\ \|\\ r\\in\[0,\\infty\[\\\}
is referred to as convex contamination model, consisting of convex contamination balls
Uc\(θ0,r\)=\{\(1−r\)\+Pθ0\+min\(1,r\)Q\|Q∈ℳ1\(𝒜\)\}\\displaystyle U\_\{c\}\(\\theta\_\{0\},r\)=\\\{\(1\-r\)\_\{\+\}P\_\{\\theta\_\{0\}\}\+\\min\(1,r\)Q\\ \|\\ Q\\in\\mathcal\{M\}\_\{1\}\(\\mathcal\{A\}\)\\\}\.
Here,ℳ1\(𝒜\)\\mathcal\{M\}\_\{1\}\(\\mathcal\{A\}\)is the set of probability distributions on𝒜\\mathcal\{A\}\. The probabilityrris called the“contamination radius”\.
Apart from the convex contamination model, other types of contamination models can be found inrieder\. On contaminated data, the least\-squares estimator may become unreliable\. Robust statistics quantifies the robustness of an estimator by its breakdown point \(BDP\)\. More precisely, the finite\-sample BDP \(huber83\) is defined as follows\.
###### Definition 3\.2\.
LetZnZ\_\{n\}be a data sample consisting ofnninstances\(Xi,Yi\),i=1,…,n\(X\_\{i\},Y\_\{i\}\),i=1,\.\.\.,n\. For an estimatorθ^∈Θ\\hat\{\\theta\}\\in\\Theta, thefinite\-sample breakdown pointis
ε∗\(θ^,Zn\)=min\{mn\|supZnm\(‖θ^\(Znm\)‖\)=∞\},\\varepsilon^\{\*\}\(\\hat\{\\theta\},Z\_\{n\}\)=\\min\\left\\\{\\frac\{m\}\{n\}\\ \\bigg\|\\ \\sup\_\{Z\_\{n\}^\{m\}\}\(\|\|\\hat\{\\theta\}\(Z\_\{n\}^\{m\}\)\|\|\)=\\infty\\right\\\},\(3\.3\)whereZnmZ\_\{n\}^\{m\}denotes the set that has exactly\(n−m\)\(n\-m\)instances in common with the original sampleZnZ\_\{n\}and whereθ^\(Znm\)\\hat\{\\theta\}\(Z\_\{n\}^\{m\}\)denotes the estimated coefficient onZnmZ\_\{n\}^\{m\}\.
The finite\-sample BDP has to be interpreted in the sense that by contaminatingmmout ofnninstances, one has full control over the estimator, so, in a worst\-case setting, can make the norm of the estimator larger than any bound\. The least\-squares estimator is an M\-estimator with the loss functionρ\(r\)=r2\\rho\(r\)=r^\{2\}, whose derivative isψ\(r\)=2r\\psi\(r\)=2r\. It is well\-known that M\-estimators with monotone derivative of the loss function have a BDP of zero, e\.g\.,maronna, including the least\-squares estimator\.
Several robust linear regression estimators have been proposed in the literature\. Classically, one replaces the squared loss function with a loss function whose derivative is bounded\. Huber regression optimizes the Huber loss \(huber64\)
LδHuber\(r\)=\{r2/2,\|r\|≤δδ\|r\|−δ2/2,\|r\|\>δ,\\displaystyle L^\{\\text\{Huber\}\}\_\{\\delta\}\(r\)=\\begin\{cases\}r^\{2\}/2,\\ \\ \\ \|r\|\\leq\\delta\\\\ \\delta\|r\|\-\\delta^\{2\}/2,\\ \\ \\ \|r\|\>\\delta\\end\{cases\},
whose derivative is bounded but still monotone\. Its BDP is at most 25%, but can be arbitrarily close to zero, depending on the distribution of the predictors \(see\(hampel, Sec\. 6\.4\)\)\. A loss function with an even redescending derivative, i\.e\.,lim\|r\|→∞\(ψ\(r\)\)=0\\lim\_\{\|r\|\\rightarrow\\infty\}\(\\psi\(r\)\)=0, is the Tukey biweight loss function,
LkTukey\(r\)=\{1−\[1−\(r/k\)2\]3,\|r\|≤k1,\|r\|\>k\.\\displaystyle L^\{\\text\{Tukey\}\}\_\{k\}\(r\)=\\begin\{cases\}1\-\[1\-\(r/k\)^\{2\}\]^\{3\},\\ \\ \\ \|r\|\\leq k\\\\ 1,\\ \\ \\ \|r\|\>k\\end\{cases\}\.
Another strategy is followed by the least trimmed squares \(LTS\) estimator proposed inrous84\. Instead of optimizing the sum of the squared residuals of allnninstances,rous84propose to replace the full sum by a truncated sum, i\.e\.,
β^LTS=argminβ∈ℝp\(1h∑i=1hr\(β\)i:n2\),\\displaystyle\\hat\{\\beta\}^\{\\text\{LTS\}\}=\\operatorname\*\{argmin\}\_\{\\beta\\in\\mathbb\{R\}^\{p\}\}\\left\(\\frac\{1\}\{h\}\\sum\_\{i=1\}^\{h\}r\(\\beta\)\_\{i:n\}^\{2\}\\right\),
wherer\(β\)=Y−Xβr\(\\beta\)=Y\-X\\betaand wherezi:nz\_\{i:n\}denotes theii\-th smallest element of the vectorzz\. In other words, one optimizes the squared residuals over a“clean”subset of cardinalityhh\.rous84show that its BDP is⌊n/2⌋\+⌊\(p\+1\)/2⌋\\lfloor n/2\\rfloor\+\\lfloor\(p\+1\)/2\\rfloor, if one selects the size of the clean subset byh=⌊n/2⌋\+1h=\\lfloor n/2\\rfloor\+1\.
### 3\.2The casep\>np\>n
For the overparametrized setting, i\.e\.,p\>np\>n, due to the singularity ofXTXX^\{T\}X, the inverse in the least\-squares solution in[3\.2](https://arxiv.org/html/2605.21494#S3.E2)is replaced by the Moore\-Penrose inverse, leading to the closed\-form solutionX\+YX^\{\+\}Y\.
As for the robust counterparts that optimize the Huber or the Tukey loss, due to the lack of a closed\-form representation of the solution, one usually applies the iterative reweighted least\-squares algorithm \(IRWLS,huber\)\. A weighted least\-squares estimator is given by
β^WLS=argminβ∈ℝp\(1n∑i=1nwi\(Yi−Xiβ\)2\)=\(XTWX\)−1XTWY,\\hat\{\\beta\}^\{\{WLS\}\}=\\operatorname\*\{argmin\}\_\{\\beta\\in\\mathbb\{R\}^\{p\}\}\\left\(\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}w\_\{i\}\(Y\_\{i\}\-X\_\{i\}\\beta\)^\{2\}\\right\)=\(X^\{T\}WX\)^\{\-1\}X^\{T\}WY,\(3\.4\)whereWWis a diagonal matrix with theii\-th diagonal elementwiw\_\{i\}\. The weightswiw\_\{i\}satisfywi≥0w\_\{i\}\\geq 0and∑iwi=1\\sum\_\{i\}w\_\{i\}=1\. Forp\>np\>n, the weighted least\-squares solution is replaced by\(X~\)\+Y\(\\tilde\{X\}\)^\{\+\}Y, whereX~=W1/2X\\tilde\{X\}=W^\{1/2\}X\. Therefore, initializing the IRWLS scheme with the least\-squares estimator reduces any IRWLS procedure to minimuml2l\_\{2\}\-norm interpolation, since the least\-squares estimator already interpolates the data\.
As a consequence, in our simulations, it was not possible to use the implementation of the IRWLS estimators that minimize the Huber loss and the Tukey loss, respectively, provided in the𝖱\\mathsf\{R\}\-packagerobustreg\. Therefore, we use the𝖱\\mathsf\{R\}\-packageMTE\(qin17,MTE\) where a given loss function is optimized by gradient descent\. We increased the maximum number of iterations to 100\.
The computation of the LTS estimator is done by iteratively updating the clean subset and by computing the least squares estimator on this updated set\. Forp\>hnp\>hn, one faces again the problem that the fit is perfect on the clean subset\. Optimizing another loss function such as the Huber or Tukey loss function on the clean subsets would not be meaningful as it would correspond to a double robustification, both in the sense of replacing the squared loss with a loss function with bounded derivative and trimming, which would at least considerably reduce the efficiency of the estimator and increase the computational costs\. In addition, we observe in our experiments that at least the Huber M\-estimator interpolates the training data onceppis sufficiently large, rendering any identification of a clean subset meaningless\.
We propose to apply a robust regression technique that allows for data withp\>np\>nwhile still optimizing the least\-squares loss and decide to consider two candidates: The sparse LTS \(alfons13\), provided in the𝖱\\mathsf\{R\}\-packagerobustHD\(alfons16\), and the robust Boosting algorithm RRBoost fromju21, provided in the𝖱\\mathsf\{R\}\-packageRRBoost\(rrboost\)\. In order to perform minimuml2l\_\{2\}\-interpolation on a clean subset, we apply either the sparse LTS or the robust Boosting algorithm on the whole training set and identify the clean subset based on the squared residuals\. On this clean subset, we compute the minimuml2l\_\{2\}\-norm interpolator\.
## 4Simulation
### 4\.1Data generation
In this paper, we solely consider linear regression\. Motivated by the experiments inkobakandmuthu20, we use a sparse true coefficient vector and generate the predictors according to a Gaussian distribution, either with independent features or with a spiked covariance matrix\.
The predictorsXiX\_\{i\}are i\.i\.d\. realizations from a multivariate normal distribution𝒩p\(μ⋅1p,Σ\)\\mathcal\{N\}\_\{p\}\(\\mu\\cdot 1\_\{p\},\\Sigma\), where1p1\_\{p\}denotes the vector of lengthppthat only consists of ones, whereμ∈ℝ\\mu\\in\\mathbb\{R\}and whereΣ\\Sigmais some covariance matrix\. In the independent case, we setΣ=Ip×p\\Sigma=I\_\{p\\times p\}, denoting the identity matrix of dimensionp×pp\\times p\. In the spiked covariance case, we setΣ=Ip×p\+ρ1p1pT\\Sigma=I\_\{p\\times p\}\+\\rho 1\_\{p\}1\_\{p\}^\{T\}\. We always useρ=0\.25\\rho=0\.25\.
The true coefficient vectorβ\\betais generated randomly, either with Gaussian components, i\.e\.,β∼𝒩p\(0p,Ip×p\)\\beta\\sim\\mathcal\{N\}\_\{p\}\(0\_\{p\},I\_\{p\\times p\}\), or with uniformly distributed components, i\.e\.,βj∼U\(\[1,2\]\)\\beta\_\{j\}\\sim U\(\[1,2\]\)i\.i\.d\.\. In our simulations, in order to generate the test loss curve w\.r\.t\.pp, we use differentppbut always let the true number of predictors bes=20s=20\. Forp\>sp\>s,p−sp\-scomponents ofβ\\betaare randomly selected and set to zero\. Forp<sp<s, we generate a data set whereXXhassscolumns, generate the responses from it as well as the noise vector, and randomly select onlypppredictor columns as input for the regression algorithm\.
We first generate the responsesYi=XiβY\_\{i\}=X\_\{i\}\\betaand add a Gaussian noise vectorϵ∼𝒩n\(0n,σ2In×n\)\\epsilon\\sim\\mathcal\{N\}\_\{n\}\(0\_\{n\},\\sigma^\{2\}I\_\{n\\times n\}\)where we setσ\\sigmasuch that on the generated sample, a specified signal\-to\-noise ratio is maintained\.
In order to generate contaminated data, we distinguish betweenYY\-contamination where only the response vector is contaminated, andXX\-contamination, where only the predictor matrixXXis contaminated\. In both cases, the contamination is injected after having computedY=Xβ\+ϵY=X\\beta\+\\epsilon, so the signal\-to\-noise ratio corresponds to the clean data\. In the case ofYY\-contamination, we specify a contamination radius,rr, and randomly select⌊rn⌋\\lfloor rn\\rfloorcomponents ofYY\. We consider additive contamination and add a fixed value ofcoutc\_\{out\}to each of the selected components\. In the case ofXX\-contamination, we randomly select⌊rn⌋\\lfloor rn\\rfloorrows of the predictor matrix\. We then add a fixed value ofcoutc\_\{out\}to⌊0\.1p⌋\\lfloor 0\.1p\\rfloorrandomly selected cells in the respective rows, where the selection of the column indices is done individually for each row\.
###### Remark 4\.1\.
Note that, although the rows are not fully contaminated, the contamination scheme is covered by case\-wiseXX\-contamination by limiting the number of contaminated rows\. We do not need to consider cell\-wise robust approaches that would be necessary in cell\-wise contamination \(alqallaf\) where the contamination radius refers to the fraction of contaminated cells, allowing for more than 50% of the rows to be contaminated\.
We always generate a data set of sizen=ntrain\+ntestn=n\_\{train\}\+n\_\{test\}and split it into a training set of sizentrainn\_\{train\}and a test set of sizentestn\_\{test\}\. In this paper, we always consider clean test data, i\.e\., theYY\- orXX\-contamination is injected only to the training set\. For example, although contaminated test data have been considered inTW24and are undeniably important when working with real data, the purpose of this simulation study is to assess whether overparametrized regression can generalize to unseen clean data, even if the training data were contaminated\.
### 4\.2Algorithms
The minimuml2l\_\{2\}\-norm interpolator is computed via the formulaβ^LS=X\+Y\\hat\{\\beta\}^\{LS\}=X^\{\+\}Y\. The Moore\-Penrose pseudo\-inverseX\+X^\{\+\}is computed using the functionginvfrom the𝖱\\mathsf\{R\}\-packageMASS\(venables\)\.
When optimizing the Huber loss or the Tukey loss, we use the gradient descent algorithm from the𝖱\\mathsf\{R\}\-packageMTE\(qin17;MTE\), more precisely, the functionhuber\.regin the implementation \(in the version from April 9, 2023\) provided in the Github repository[https://github\.com/shaobo\-li/MTE](https://github.com/shaobo-li/MTE)\. Here, we allow for 100 iterations at most\. Otherwise, the algorithm terminates once the difference of the current and the previous coefficient vector, quantified in the\|\|⋅\|\|∞\|\|\\cdot\|\|\_\{\\infty\}\-norm, is smaller than10−410^\{\-4\}\. The Huber loss is already implemented in this package\. We use the hyperparameterδ=1\.345\\delta=1\.345\. As for the Tukey loss, we implemented it ourselves and just apply the gradient iterations to this loss function\. Here, we usek=4\.685k=4\.685\. These choices forδ\\deltaandkkcorrespond to an asymptotic efficiency of 95% in the Gaussian setting\.
The sparse LTS \(SLTS\) is implemented in the𝖱\\mathsf\{R\}\-packagerobustHD\(alfons16\)\. We use the functionsparseLTSin the default settings except foralphawhich we set to 0\.5, corresponding to a clean subset of sizeh=⌊n/2⌋h=\\lfloor n/2\\rfloor\. We use the clean subset identified by SLTS in order to compute the minimuml2l\_\{2\}\-norm interpolator on this subset\. In addition, we also report the performance of the SLTS model itself\.
The robust Boosting algorithm fromju21is implemented in the𝖱\\mathsf\{R\}\-packageRRBoost\(rrboost\)\. We use the functionBoostwith default settings\. Similarly as for SLTS, we use RRBoost in order to identify a clean subset, but also report the performance of RRBoost itself\.
### 4\.3Evaluation
We consider several scenarios with different values forntrainn\_\{train\},pp,μ\\mu, the signal\-to\-noise ratio,rrand the additive perturbation valuecoutc\_\{out\}\. Moreover, we distinguish between clean data,YY\-contaminated andXX\-contaminated training data\.
As for the number of predictors, we use eachppfrom the set
\{5,10,20,30,40,50,60,80,100,150,200,250,300,400,500,750,1000,1250,1500,1750,2000,3000,4000,5000\}\.\\begin\{split\}\\\{5,10,20,30,40,50,60,80,100,150,200,250,300,400,500,\\\\ 750,1000,1250,1500,1750,2000,3000,4000,5000\\\}\.\\end\{split\}\(4\.1\)When applying the robust Boosting algorithm, where we only consider values up to12501250forppfor numerical reasons\.
ppntrainn\_\{train\}ntestn\_\{test\}SNR\\operatorname\*\{SNR\}μ\\muβ\\betaCont\.rrcoutc\_\{out\}Eq\.[4\.1](https://arxiv.org/html/2605.21494#S4.E1)5050\{0\.1,0\.5,2,5\}\\\{0\.1,0\.5,2,5\\\}0Normal/uniformNone,XX,YY\{0\.1,0\.25,0\.5\}\\\{0\.1,0\.25,0\.5\\\}100Table 1:Basic scenario specificationsApart from the basic scenarios, specified in Tab\.[1](https://arxiv.org/html/2605.21494#S4.T1), where we always use both the independent design and the spiked covariance design, we also consider some alternative scenarios\. For the independent design, we setμ=5\\mu=5, so the predictors are not centered, and apply the minimuml2l\_\{2\}\-norm interpolator, Huber\-based interpolation and SLTS\-based interpolation\. Moreover, we increasentrainn\_\{train\}andntestn\_\{test\}to 200 and considerYY\-contamination withr∈\{0\.1,0\.25,0\.5,0\.75,0\.9\}r\\in\\\{0\.1,0\.25,0\.5,0\.75,0\.9\\\}\. Here, we apply minimuml2l\_\{2\}\-norm interpolation and Huber\-loss interpolation\. For computational reasons, the maximumppis40004000\. Finally, we consider the same scenarios but withcout=10000c\_\{out\}=10000and again apply these two algorithms\.
We repeat all simulationsB=500B=500times for each scenario and compute the mean test MSE\. We then plot the mean training MSE, the mean test MSE and thel1l\_\{1\}\-norm differences of the estimated and true coefficients, i\.e\.,‖β^−β‖1\|\|\\hat\{\\beta\}\-\\beta\|\|\_\{1\}, against the numberppof predictors\. For the iterative gradient\-based Huber\- and Tukey\-loss minimization algorithms, we also visualize the number of iterations in dependence ofpp\.
## 5Test errors
### 5\.1Independent features,μ=0\\mu=0
#### 5\.1\.1Minimuml2l\_\{2\}\-norm interpolation
Figure 1:Test MSE of minimuml2l\_\{2\}\-norm interpolation when trained on clean training data\.In Fig\.[1](https://arxiv.org/html/2605.21494#S5.F1), the test MSE curves attain their minimum atp=s=20p=s=20if the SNR is 2 or 5\. After the peak atp=np=n, the test MSE decreases, and stays nearly constant afterwards, while always being higher than the minimum atp=sp=s\. For a SNR smaller than 1, the test MSE first increases untilp=np=n, and decreases afterwards, even attaining smaller values than forp<np<n\. In the case ofXX\-contamination, as visualized in Fig\.[2](https://arxiv.org/html/2605.21494#S5.F2), the test MSE curves resemble those from the case of clean training data\.
A completely different behavior can be observed in the case ofYY\-contamination\. Forp<np<n, the test MSE curve is strictly increasing, disregarding the SNR\. After the peak, the curve is strictly decreasing\. For largerrr, the test MSE is higher forp<np<nthan for smallerrr\. Forp=5000p=5000, the test MSE is comparable to the test MSE forp=5000p=5000achieved when training the model on clean data\. Therefore, the test MSE decrease is steeper for largerrr\.








Figure 2:Test MSE of minimuml2l\_\{2\}\-norm interpolation when trained on contaminated training data\.
#### 5\.1\.2Huber\-loss interpolation
Figure 3:Test MSE of Huber\-loss interpolation when trained on clean training data\.The main difference between the MSE curves in Fig\.[1](https://arxiv.org/html/2605.21494#S5.F1)and Fig\.[4](https://arxiv.org/html/2605.21494#S5.F4)in comparison with Fig\.[3](https://arxiv.org/html/2605.21494#S5.F3)and Fig\.[4](https://arxiv.org/html/2605.21494#S5.F4)is that for the minimuml2l\_\{2\}\-norm interpolator, the peak atp=np=nis much more elevated\. This may be explained by the singularity of the predictor matrix atp=np=n, which does not affect the gradient descent algorithm for the Huber\-loss minimization to the same extent\. ForXX\-contamination, when optimizing the Huber loss, the peak moves to the right oncerrincreases, while the MSE values for very low and very largeppremain nearly unaffected\.
ForYY\-contamination, the minimum of the test MSE is still attained atp=sp=sif the SNR is larger than 1 andr=0\.1r=0\.1\. Forr=0\.25r=0\.25andr=0\.5r=0\.5, the test MSE is strictly increasing until attaining the peak \(either atp=np=nor atp=100p=100\), and decreasing afterwards\. However, the MSE values atp=5000p=5000are larger in the contaminated case than in the case with clean training data, and increase withrr\. ForXX\-contamination, the peak is attained forp=100p=100forr=0\.1r=0\.1and is shifted top=500p=500forr=0\.5r=0\.5\. The MSE values for lowppand largeppare comparable with the MSE values for the case of clean training data\.






Figure 4:Test MSE of Huber\-loss interpolation when trained on contaminated training data\.
#### 5\.1\.3Tukey\-loss interpolation
Figure 5:Test MSE of Tukey\-loss interpolation when trained on clean training data\.In Fig\.[5](https://arxiv.org/html/2605.21494#S5.F5)and Fig\.[6](https://arxiv.org/html/2605.21494#S5.F6), one can observe at peak atp=np=nor atp=20p=20if the SNR is larger than 1\. Before the peak, the curves are strictly increasing, and remain constant after briefly decreasing after the peak\. For a low SNR, the MSE curves are nearly constant\. ForXX\-contamination and largerr, one can observe an increase of the MSE curves, in particular if the SNR is high\.






Figure 6:Test MSE of Tukey\-loss interpolation when trained on contaminated training data\.
#### 5\.1\.4SLTS\-based interpolation
Figure 7:Test MSE of SLTS\-based interpolation when trained on clean training data\.Figure 8:Test MSE of SLTS when trained on clean training data\.





Figure 9:Test MSE of SLTS\-based interpolation when trained on contaminated training data\.In Fig\.[7](https://arxiv.org/html/2605.21494#S5.F7), one can observe a slight decrease in the MSE curves but without any visible peak or minimum\. In Fig\.[9](https://arxiv.org/html/2605.21494#S5.F9), it is revealed that the behavior forXX\-contamination andYY\-contamination withr=0\.1r=0\.1andr=0\.25r=0\.25leads to similar MSE curves\. In contrast, in the case ofYY\-contamination andr=0\.5r=0\.5, the MSE is much higher for lowppthan in the other scenarios, attains a peak atp=20p=20and clearly decreases afterwards, attaining comparable values atp=5000p=5000as in the other scenarios\.






Figure 10:Test MSE of SLTS when trained on contaminated training data\.Standard SLTS results in similar MSE curves when trained on clean data as SLTS\-based interpolation, as it can be observed in Fig\.[8](https://arxiv.org/html/2605.21494#S5.F8)\. Standard SLTS leads to nearly constant MSE curves forXX\-contamination as seen in Fig\.[10](https://arxiv.org/html/2605.21494#S5.F10), while for SLTS\-based interpolation, the MSE was higher for lowppand slightly decreased in order to remain nearly constant at largerpp\. A considerable difference can be observed forr=0\.5r=0\.5andYY\-contamination\. For SLTS, the MSE is always around 5000, while for SLTS\-based interpolation, the curves started around 5000 as well in order to decrease significantly for growingpp\.
#### 5\.1\.5Boosting\-based interpolation
Figure 11:Test MSE of RRBoost\-based interpolation when trained on clean training data\.Figure 12:Test MSE of RRBoost when trained on clean training data\.





Figure 13:Test MSE of RRBoost\-based interpolation when trained on contaminated training data\.The MSE curves in Fig\.[11](https://arxiv.org/html/2605.21494#S5.F11)and Fig\.[13](https://arxiv.org/html/2605.21494#S5.F13)first increase, attain at peak atp=20p=20and decrease afterwards\. When training the model on clean data or onXX\-contaminated data, the MSE attains similar values for largeppas for lowpp\. ForYY\-contamination andr=0\.25r=0\.25andr=0\.5r=0\.5, the MSE is larger for lowppand significantly decreases after the peak, attaining however higher values atp=5000p=5000than in the other scenarios\.






Figure 14:Test MSE of RRBoost when trained on contaminated training data\.In contrast to RRBoost\-based interpolation, the MSE curves slightly increase for growingppin Fig\.[12](https://arxiv.org/html/2605.21494#S5.F12)and Fig\.[14](https://arxiv.org/html/2605.21494#S5.F14)\. ForYY\-contamination andr=0\.25r=0\.25, the growth is steeper than for all other scenarios while forr=0\.5r=0\.5, the MSE curves remain nearly constant\.
### 5\.2Spiked covariance design,μ=0\\mu=0
#### 5\.2\.1Minimuml2l\_\{2\}\-norm interpolation
Figure 15:Test MSE of minimuml2l\_\{2\}\-norm interpolation when trained on clean training data\.





Figure 16:Test MSE of minimuml2l\_\{2\}\-norm interpolation when trained on contaminated training data\.
#### 5\.2\.2Huber\-loss interpolation
Figure 17:Test MSE of Huber\-loss interpolation when trained on clean training data\.





Figure 18:Test MSE of Huber\-loss interpolation when trained on contaminated training data\.
#### 5\.2\.3Tukey\-loss interpolation
Figure 19:Test MSE of Tukey\-loss interpolation when trained on clean training data\.





Figure 20:Test MSE of Tukey\-loss interpolation when trained on contaminated training data\.
#### 5\.2\.4SLTS\-based interpolation
Figure 21:Test MSE of SLTS\-based interpolation when trained on clean training data\.Figure 22:Test MSE of SLTS when trained on clean training data\.





Figure 23:Test MSE of SLTS\-based interpolation when trained on contaminated training data\.





Figure 24:Test MSE of SLTS when trained on contaminated training data\.
#### 5\.2\.5Boosting\-based interpolation
Figure 25:Test MSE of RRBoost\-based interpolation when trained on clean training data\.Figure 26:Test MSE of RRBoost when trained on clean training data\.





Figure 27:Test MSE of RRBoost\-based interpolation when trained on contaminated training data\.





Figure 28:Test MSE of RRBoost when trained on contaminated training data\.The curves in Fig\.[15](https://arxiv.org/html/2605.21494#S5.F15), Fig\.[16](https://arxiv.org/html/2605.21494#S5.F16), Fig\.[17](https://arxiv.org/html/2605.21494#S5.F17), Fig\.[18](https://arxiv.org/html/2605.21494#S5.F18), Fig\.[19](https://arxiv.org/html/2605.21494#S5.F19), Fig\.[20](https://arxiv.org/html/2605.21494#S5.F20), Fig\.[21](https://arxiv.org/html/2605.21494#S5.F21), Fig\.[23](https://arxiv.org/html/2605.21494#S5.F23), Fig\.[22](https://arxiv.org/html/2605.21494#S5.F22), Fig\.[25](https://arxiv.org/html/2605.21494#S5.F25), Fig\.[27](https://arxiv.org/html/2605.21494#S5.F27)and Fig\.[26](https://arxiv.org/html/2605.21494#S5.F26)resemble those for the independent design\. The only difference is that the MSE values are higher for the spiked covariance design and that, in the case of an MSE decrease for growingpp, the attained MSE values forp=5000p=5000are considerably higher in the contaminated settings than in the clean setting\. In Fig\.[24](https://arxiv.org/html/2605.21494#S5.F24)and Fig\.[28](https://arxiv.org/html/2605.21494#S5.F28), the MSE curves fluctuate forYY\-contamination andr=0\.5r=0\.5, in contrast to the nearly constant curves when having independent design\.
### 5\.3Independent features,μ=5\\mu=5
#### 5\.3\.1Minimuml2l\_\{2\}\-norm interpolation
Figure 29:Test MSE of minimuml2l\_\{2\}\-norm interpolation when trained on clean training data withμ=5\\mu=5\.In contrast to the caseμ=0\\mu=0, the MSE curves forYY\-contamination remain around the MSE values for smallppafter the peak\. In the case ofXX\-contamination and clean training data, the curves resemble those from the caseμ=0\\mu=0, with the difference that the MSE values are higher, as depicted in Fig\.[29](https://arxiv.org/html/2605.21494#S5.F29)and Fig\.[30](https://arxiv.org/html/2605.21494#S5.F30)\.






Figure 30:Test MSE of minimuml2l\_\{2\}\-norm interpolation when trained on contaminated training data withμ=5\\mu=5\.
#### 5\.3\.2Huber\-loss interpolation
Figure 31:Test MSE of Huber\-loss interpolation when trained on clean training data\.As one can observe in Fig\.[31](https://arxiv.org/html/2605.21494#S5.F31)and Fig\.[32](https://arxiv.org/html/2605.21494#S5.F32), in contrast to the caseμ=0\\mu=0, there is no clear peak in the MSE curves forXX\-contamination and clean training data\. In particular, forYY\-contamination andr=0\.1r=0\.1andr=0\.25r=0\.25, the MSE increases for growingpp, while forμ=0\\mu=0, it decreased after attaining a peak\.






Figure 32:Test MSE of Huber\-loss interpolation when trained on contaminated training data\.
#### 5\.3\.3SLTS\-based interpolation
Figure 33:Test MSE of SLTS\-based interpolation when trained on clean training data\.Figure 34:Test MSE of SLTS when trained on clean training data\.





Figure 35:Test MSE of SLTS\-based interpolation when trained on contaminated training data\.





Figure 36:Test MSE of SLTS when trained on contaminated training data\.The MSE curves depicted in Fig\.[33](https://arxiv.org/html/2605.21494#S5.F33), Fig\.[35](https://arxiv.org/html/2605.21494#S5.F35), Fig\.[34](https://arxiv.org/html/2605.21494#S5.F34)and Fig\.[36](https://arxiv.org/html/2605.21494#S5.F36)resemble those from the caseμ=0\\mu=0, except for the case ofXX\-contamination andr=0\.5r=0\.5, where the curves fluctuate instead of being nearly constant for raw SLTS and decreasing for largerppfor SLTS\-based interpolation, respectively\.
### 5\.4n=200n=200
#### 5\.4\.1Minimuml2l\_\{2\}\-norm interpolation





Figure 37:Test MSE of minimuml2l\_\{2\}\-norm interpolation when trained onYY\-contaminated training data\.In contrast to the casen=50n=50in Fig\.[2](https://arxiv.org/html/2605.21494#S5.F2), there is no longer a minimum atp=sp=sin the MSE curves depicted in Fig\.[37](https://arxiv.org/html/2605.21494#S5.F37), but a pronounced peak atp=np=n\. After the peak, the MSE strictly decreases\. The attained MSE values atp=5000p=5000are larger than for the casen=50n=50, but the curves seem to decrease further ifppgrew larger\.
#### 5\.4\.2Huber\-loss interpolation





Figure 38:Test MSE of Huber\-loss interpolation when trained onYY\-contaminated training data\.The MSE curves in Fig\.[38](https://arxiv.org/html/2605.21494#S5.F38)are similary than those in Fig\.[4](https://arxiv.org/html/2605.21494#S5.F4)forr∈\{0\.1,0\.25\}r\\in\\\{0\.1,0\.25\\\}, while the MSE values are slightly smaller\. Forr=0\.5r=0\.5, the MSE values are larger, but the shape of the curves is similar as for the casen=50n=50\. The peak is attained shortly afterp=np=n\. Forr∈\{0\.75,0\.9\}r\\in\\\{0\.75,0\.9\\\}, the MSE is very large for smallpp, decreases slightly in order to remain at a plateau, and significantly decreases oncep\>np\>n, although the MSE values atp=5000p=5000are considerably larger than for smaller contamination radii\.
### 5\.5n=200n=200,cout=10000c\_\{out\}=10000
#### 5\.5\.1Minimuml2l\_\{2\}\-norm interpolation





Figure 39:Test MSE of minimuml2l\_\{2\}\-norm interpolation when trained onYY\-contaminated training data\.The MSE curves in Fig\.[39](https://arxiv.org/html/2605.21494#S5.F39)resemble those from Fig\.[37](https://arxiv.org/html/2605.21494#S5.F37)forcout=100c\_\{out\}=100, although the MSE values are clearly larger\.
#### 5\.5\.2Huber\-loss interpolation





Figure 40:Test MSE of Huber\-loss interpolation when trained onYY\-contaminated training data\.Forr∈\{0\.1,0\.25\}r\\in\\\{0\.1,0\.25\\\}, the MSE curves in Fig\.[40](https://arxiv.org/html/2605.21494#S5.F40)attain a minimum atp=sp=sand monotonically increase thereafter\. In contrast to the casecout=100c\_\{out\}=100visualized in Fig\.[38](https://arxiv.org/html/2605.21494#S5.F38), the curves do not decrease even for the largest values ofppconsidered\. Forr≥0\.5r\\geq 0\.5, the MSE monotonically increases and attains a peak atp=2000p=2000in order to seemingly decrease asppgrows further\.
## 6Training errors
### 6\.1Independent features,μ=0\\mu=0
#### 6\.1\.1Minimuml2l\_\{2\}\-norm interpolation
Figure 41:Training MSE of minimuml2l\_\{2\}\-norm interpolation when trained on clean training data\.By interpolation, the training error vanishes oncep\>np\>n, as depicted in Fig\.[41](https://arxiv.org/html/2605.21494#S6.F41)and Fig\.[42](https://arxiv.org/html/2605.21494#S6.F42)\. As expected, the training error is higher at lowppforYY\-contaminated as forXX\-contaminated and clean data\.






Figure 42:Training MSE of minimuml2l\_\{2\}\-norm interpolation when trained on contaminated training data\.
#### 6\.1\.2Huber\-loss interpolation
Figure 43:Training MSE of Huber\-loss interpolation when trained on clean training data\.The training error curves in Fig\.[43](https://arxiv.org/html/2605.21494#S6.F43)and Fig\.[44](https://arxiv.org/html/2605.21494#S6.F44)show that the training error does not vanish directly atp=np=nbut at some higherpp, depending on the SNR and the contamination\. In particular,XXcontamination and a highrrlead to a late vanish of the training error, which requires up top=1250p=1250\. For smallpphowever, the training error is similar to that of minimuml2l\_\{2\}\-norm interpolation\.






Figure 44:Training MSE of Huber\-loss interpolation when trained on contaminated training data\.
#### 6\.1\.3Tukey\-loss interpolation
Figure 45:Training MSE of Tukey\-loss interpolation when trained on clean training data\.The training MSE for Tukey\-loss interpolation, as depicted in Fig\.[45](https://arxiv.org/html/2605.21494#S6.F45)and Fig\.[46](https://arxiv.org/html/2605.21494#S6.F46), remains nearly constant forYY\-contaminated and clean training data\. It is not surprising that the training error does not vanish, as the Tukey loss is a redescending loss so that the data cannot be interpolated unless all residuals are smaller than the threshold in absolute value\. As forXX\-contamination andr=0\.25r=0\.25andr=0\.5r=0\.5, the training error even increases for growingpp\.






Figure 46:Training MSE of Tukey\-loss interpolation when trained on contaminated training data\.
#### 6\.1\.4SLTS\-based interpolation
Figure 47:Training MSE of SLTS\-based interpolation when trained on clean training data\.Figure 48:Training MSE of SLTS when trained on clean training data\.





Figure 49:Training MSE of SLTS\-based interpolation when trained on contaminated training data\.





Figure 50:Training MSE of SLTS when trained on contaminated training data\.Neither SLTS\-based interpolation nor SLTS leads to a vanishing training loss, since the model is trained only on a clean subset\. As for the case of clean training data, as depicted in Fig\.[47](https://arxiv.org/html/2605.21494#S6.F47)for SLTS\-based interpolation, one can observe a slight decrease of the training error asppincreases, while the training MSE for SLTS increases afterp=sp=s, as visualized in Fig\.[48](https://arxiv.org/html/2605.21494#S6.F48)\. For contaminated data, as seen in Fig\.[49](https://arxiv.org/html/2605.21494#S6.F49)and Fig\.[50](https://arxiv.org/html/2605.21494#S6.F50), the training error decreases for growingppforXX\-contamination, while remaining nearly constant forYY\-contamination andr=0\.1r=0\.1andr=0\.25r=0\.25, except for SLTS\-based interpolation and an SNR of 0\.1, where a slight decrease can be observed\. For SLTS\-based interpolation andYY\-contamination withr=0\.5r=0\.5, one can observe a light peak atp=20p=20and a considerable descent of the training error afterwards, while the training error remains constantly high for the raw SLTS\.
#### 6\.1\.5Boosting\-based interpolation
Figure 51:Training MSE of RRBoost\-based interpolation when trained on clean training data\.Figure 52:Training MSE of RRBoost when trained on clean training data\.





Figure 53:Training MSE of RRBoost\-based interpolation when trained on contaminated training data\.





Figure 54:Training MSE of RRBoost when trained on contaminated training data\.For RRBoost\-based interpolation, Fig\.[51](https://arxiv.org/html/2605.21494#S6.F51)reveals that on clean data, one can observe a peak atp=20p=20and a decrease of the training error afterwards for an SNR of 0\.1 and 0\.5, while for an SNR of 2 and 5, it slightly increases for growingpp\. The training MSE for RRBoost itself attains a minimum atp=sp=sand slightly increases afterwards, in order to remain constant for largepp, as shown in Fig\.[52](https://arxiv.org/html/2605.21494#S6.F52)\. One can see in Fig\.[53](https://arxiv.org/html/2605.21494#S6.F53)for RRBoost\-based interpolation forYY\-contamination, there is a pronounced peak atp=20p=20and a decrease of the training error afterwards\. ForXX\-contamination, the training error monotonically decreases after attaining a peak atp=20p=20\. For RRBoosting, as visualized in Fig\.[54](https://arxiv.org/html/2605.21494#S6.F54), one can observe a nearly monotonically slight decrease of the training error for growingpp, which is steeper forYY\-contamination than forXX\-contamination and clean data\.
### 6\.2Spiked covariance design,μ=0\\mu=0
#### 6\.2\.1Minimuml2l\_\{2\}\-norm interpolation
Figure 55:Training MSE of minimuml2l\_\{2\}\-norm interpolation when trained on clean training data\.





Figure 56:Training MSE of minimuml2l\_\{2\}\-norm interpolation when trained on contaminated training data\.
#### 6\.2\.2Huber\-loss interpolation
Figure 57:Training MSE of Huber\-loss interpolation when trained on clean training data\.





Figure 58:Training MSE of Huber\-loss interpolation when trained on contaminated training data\.
#### 6\.2\.3Tukey\-loss interpolation
Figure 59:Training MSE of Tukey\-loss interpolation when trained on clean training data\.





Figure 60:Training MSE of Tukey\-loss interpolation when trained on contaminated training data\.
#### 6\.2\.4SLTS\-based interpolation
Figure 61:Training MSE of SLTS\-based interpolation when trained on clean training data\.Figure 62:Training MSE of SLTS when trained on clean training data\.





Figure 63:Training MSE of SLTS\-based interpolation when trained on contaminated training data\.





Figure 64:Training MSE of SLTS when trained on contaminated training data\.
#### 6\.2\.5Boosting\-based interpolation
Figure 65:Training MSE of RRBoost\-based interpolation when trained on clean training data\.Figure 66:Training MSE of RRBoost\-based interpolation when trained on clean training data\.





Figure 67:Training MSE of RRBoosting when trained on contaminated training data\.





Figure 68:Training MSE of RRBoosting when trained on contaminated training data\.It can be observed in Fig\.[55](https://arxiv.org/html/2605.21494#S6.F55), Fig\.[56](https://arxiv.org/html/2605.21494#S6.F56), Fig\.[57](https://arxiv.org/html/2605.21494#S6.F57), Fig\.[58](https://arxiv.org/html/2605.21494#S6.F58), Fig\.[59](https://arxiv.org/html/2605.21494#S6.F59), Fig\.[60](https://arxiv.org/html/2605.21494#S6.F60), Fig\.[61](https://arxiv.org/html/2605.21494#S6.F61), Fig\.[63](https://arxiv.org/html/2605.21494#S6.F63), Fig\.[62](https://arxiv.org/html/2605.21494#S6.F62), Fig\.[64](https://arxiv.org/html/2605.21494#S6.F64), Fig\.[65](https://arxiv.org/html/2605.21494#S6.F65), Fig\.[67](https://arxiv.org/html/2605.21494#S6.F67), Fig\.[66](https://arxiv.org/html/2605.21494#S6.F66)and Fig\.[68](https://arxiv.org/html/2605.21494#S6.F68)that the training MSE curves resemble those from the case of independent design, although the MSE values themselves are higher\.
### 6\.3Independent features,μ=5\\mu=5
#### 6\.3\.1Minimuml2l\_\{2\}\-norm interpolation
Figure 69:Training MSE of minimuml2l\_\{2\}\-norm interpolation when trained on clean training data\.





Figure 70:Training MSE of minimuml2l\_\{2\}\-norm interpolation when trained on contaminated training data\.
#### 6\.3\.2Huber\-loss interpolation
Figure 71:Training MSE of Huber\-loss interpolation when trained on clean training data\.





Figure 72:Training MSE of Huber\-loss interpolation when trained on contaminated training data\.
#### 6\.3\.3SLTS\-based interpolation
Figure 73:Training MSE of SLTS\-based interpolation when trained on clean training data\.Figure 74:Training MSE of SLTS when trained on clean training data\.





Figure 75:Training MSE of SLTS\-based interpolation when trained on contaminated training data\.





Figure 76:Training MSE of SLTS when trained on contaminated training data\.It can be observed in Fig\.[69](https://arxiv.org/html/2605.21494#S6.F69), Fig\.[70](https://arxiv.org/html/2605.21494#S6.F70), Fig\.[71](https://arxiv.org/html/2605.21494#S6.F71), Fig\.[72](https://arxiv.org/html/2605.21494#S6.F72), Fig\.[73](https://arxiv.org/html/2605.21494#S6.F73), Fig\.[75](https://arxiv.org/html/2605.21494#S6.F75), Fig\.[74](https://arxiv.org/html/2605.21494#S6.F74)and Fig\.[76](https://arxiv.org/html/2605.21494#S6.F76)that the training MSE curves generally resemble those from the caseμ=0\\mu=0\. ForXX\-contamination and a low SNR and forYY\-contamination andr=0\.5r=0\.5, the MSE remains nearly constant after the peak, in contrast to the decreasing behavior in the settingμ=0\\mu=0\. In addition, the training error for Huber\-loss interpolation vanishes much later than in the caseμ=0\\mu=0\. For SLTS\-based interpolation, the training MSE does not decrease withppforYY\-contamination andr=0\.5r=0\.5in Fig\.[75](https://arxiv.org/html/2605.21494#S6.F75), in contrast to the caseμ=0\\mu=0, see Fig\.[63](https://arxiv.org/html/2605.21494#S6.F63)\. Moreover, the training MSE is larger for uniformly distributed coefficients than for Gaussian coefficients in the case ofXX\-contamination forμ=5\\mu=5, which is not the case forμ=0\\mu=0\.
### 6\.4n=200n=200
#### 6\.4\.1Minimuml2l\_\{2\}\-norm interpolation





Figure 77:Training MSE of minimuml2l\_\{2\}\-norm interpolation when trained onYY\-contaminated training data\.The training MSE vanishes atp=np=n, as expected \(it is an issue of theplotfunction in𝖱\\mathsf\{R\}that it seems that the MSE vanishes earlier in Fig\.[77](https://arxiv.org/html/2605.21494#S6.F77)\)\.
#### 6\.4\.2Huber\-loss interpolation





Figure 78:Training MSE of Huber\-loss interpolation when trained onYY\-contaminated training data\.It already has been observed in Fig\.[57](https://arxiv.org/html/2605.21494#S6.F57)that for Huber\-loss interpolation, the training MSE vanishes at somep\>np\>n\. This is again the case in Fig\.[78](https://arxiv.org/html/2605.21494#S6.F78), where the training error vanishes at aroundp=500p=500\.
### 6\.5n=200n=200,cout=10000c\_\{out\}=10000
#### 6\.5\.1Minimuml2l\_\{2\}\-norm interpolation





Figure 79:Training MSE of minimuml2l\_\{2\}\-norm interpolation when trained onYY\-contaminated training data\.The training MSE curves in Fig\.[79](https://arxiv.org/html/2605.21494#S6.F79)resemble those from Fig\.[77](https://arxiv.org/html/2605.21494#S6.F77)for the casecoutc\_\{out\}, although the MSE values forp<np<nare clearly larger\.
#### 6\.5\.2Huber\-loss interpolation





Figure 80:Training MSE of Huber\-loss interpolation when trained onYY\-contaminated training data\.The training MSE curves in Fig\.[80](https://arxiv.org/html/2605.21494#S6.F80)resemble those from Fig\.[78](https://arxiv.org/html/2605.21494#S6.F78)for the casecout=100c\_\{out\}=100, but they start much later to considerably decrease\.
## 7l1l\_\{1\}\-norm coefficient differences
### 7\.1Independent features,μ=0\\mu=0
#### 7\.1\.1Minimuml2l\_\{2\}\-norm interpolation
Figure 81:Differences‖β^−β‖1/n\|\|\\hat\{\\beta\}\-\\beta\|\|\_\{1\}/nfor the estimated coefficient vectorβ^\\hat\{\\beta\}of minimuml2l\_\{2\}\-norm interpolation when trained on clean training data and the true coefficient vectorβ\\beta\.





Figure 82:Differences‖β^−β‖1/n\|\|\\hat\{\\beta\}\-\\beta\|\|\_\{1\}/nfor the estimated coefficient vectorβ^\\hat\{\\beta\}of minimuml2l\_\{2\}\-norm interpolation when trained on contaminated training data and the true coefficient vectorβ\\beta\.One can observe in Fig\.[81](https://arxiv.org/html/2605.21494#S7.F81)and Fig\.[82](https://arxiv.org/html/2605.21494#S7.F82)that thel1l\_\{1\}\-norm coefficient differences attain a peak atp=np=nand monotonically decrease for largerpp\. For high SNRs on clean data andXX\-contaminated data, one can also observe a local minimum atp=20p=20andp∈\{30,40\}p\\in\\\{30,40\\\}, respectively\.
#### 7\.1\.2Huber\-loss interpolation
Figure 83:Differences‖β^−β‖1/n\|\|\\hat\{\\beta\}\-\\beta\|\|\_\{1\}/nfor the estimated coefficient vectorβ^\\hat\{\\beta\}of Huber\-loss interpolation when trained on clean training data and the true coefficient vectorβ\\beta\.





Figure 84:Differences‖β^−β‖1/n\|\|\\hat\{\\beta\}\-\\beta\|\|\_\{1\}/nfor the estimated coefficient vectorβ^\\hat\{\\beta\}of Huber\-loss interpolation when trained on contaminated training data and the true coefficient vectorβ\\beta\.The curves in Fig\.[83](https://arxiv.org/html/2605.21494#S7.F83)and Fig\.[84](https://arxiv.org/html/2605.21494#S7.F84)show that on clean data or contaminated data with low contamination radius, the differences first grow and monotonically decrease asppfurther grows\. For larger contamination radii, the curves are nearly constant until they decrease monotonically\.
#### 7\.1\.3Tukey\-loss interpolation
Figure 85:Differences‖β^−β‖1/n\|\|\\hat\{\\beta\}\-\\beta\|\|\_\{1\}/nfor the estimated coefficient vectorβ^\\hat\{\\beta\}of Tukey\-loss interpolation when trained on clean training data and the true coefficient vectorβ\\beta\.





Figure 86:Differences‖β^−β‖1/n\|\|\\hat\{\\beta\}\-\\beta\|\|\_\{1\}/nfor the estimated coefficient vectorβ^\\hat\{\\beta\}of Tukey\-loss interpolation when trained on contaminated training data and the true coefficient vectorβ\\beta\.Fig\.[85](https://arxiv.org/html/2605.21494#S7.F85)and Fig\.[86](https://arxiv.org/html/2605.21494#S7.F86)reveal that regardless of the contamination, the curves are nearly constant untilp=20p=20and monotonically decrease asppgrows further\.
#### 7\.1\.4SLTS\-based interpolation
Figure 87:Differences‖β^−β‖1/n\|\|\\hat\{\\beta\}\-\\beta\|\|\_\{1\}/nfor the estimated coefficient vectorβ^\\hat\{\\beta\}of SLTS\-based interpolation when trained on clean training data and the true coefficient vectorβ\\beta\.Figure 88:Differences‖β^−β‖1/n\|\|\\hat\{\\beta\}\-\\beta\|\|\_\{1\}/nfor the estimated coefficient vectorβ^\\hat\{\\beta\}of SLTS when trained on clean training data and the true coefficient vectorβ\\beta\.





Figure 89:Differences‖β^−β‖1/n\|\|\\hat\{\\beta\}\-\\beta\|\|\_\{1\}/nfor the estimated coefficient vectorβ^\\hat\{\\beta\}of SLTS\-based interpolation when trained on contaminated training data and the true coefficient vectorβ\\beta\.





Figure 90:Differences‖β^−β‖1/n\|\|\\hat\{\\beta\}\-\\beta\|\|\_\{1\}/nfor the estimated coefficient vectorβ^\\hat\{\\beta\}of SLTS when trained on contaminated training data and the true coefficient vectorβ\\beta\.Fig\.[87](https://arxiv.org/html/2605.21494#S7.F87)and Fig\.[88](https://arxiv.org/html/2605.21494#S7.F88)show a nearly monotonically decreasing behavior of the coefficient differences on clean data\. As for contaminated data, Fig\.[89](https://arxiv.org/html/2605.21494#S7.F89)and Fig\.[90](https://arxiv.org/html/2605.21494#S7.F90)reveal a similar structure as on clean data, both for SLTS and SLTS\-based interpolation, except forYY\-contamination andr=0\.5r=0\.5, where the differences for SLTS\-based interpolation first grow for high SNRs and monotonically decrease afterwards\.
#### 7\.1\.5Boosting\-based interpolation
Figure 91:Differences‖β^−β‖1/n\|\|\\hat\{\\beta\}\-\\beta\|\|\_\{1\}/nfor the estimated coefficient vectorβ^\\hat\{\\beta\}of RRBoosting\-based interpolation when trained on clean training data and the true coefficient vectorβ\\beta\.





Figure 92:Differences‖β^−β‖1/n\|\|\\hat\{\\beta\}\-\\beta\|\|\_\{1\}/nfor the estimated coefficient vectorβ^\\hat\{\\beta\}of RRBoosting\-based interpolation when trained on contaminated training data and the true coefficient vectorβ\\beta\.Fig\.[113](https://arxiv.org/html/2605.21494#S7.F113)and Fig\.[114](https://arxiv.org/html/2605.21494#S7.F114)show that thel1l\_\{1\}\-differences attain a peak atp=20p=20and monotonically decrease appgrows further\.
### 7\.2Spiked covariance design,μ=0\\mu=0
#### 7\.2\.1Minimuml2l\_\{2\}\-norm interpolation
Figure 93:Differences‖β^−β‖1/n\|\|\\hat\{\\beta\}\-\\beta\|\|\_\{1\}/nfor the estimated coefficient vectorβ^\\hat\{\\beta\}of minimuml2l\_\{2\}\-norm interpolation when trained on clean training data and the true coefficient vectorβ\\beta\.





Figure 94:Differences‖β^−β‖1/n\|\|\\hat\{\\beta\}\-\\beta\|\|\_\{1\}/nfor the estimated coefficient vectorβ^\\hat\{\\beta\}of minimuml2l\_\{2\}\-norm interpolation when trained on contaminated training data and the true coefficient vectorβ\\beta\.
#### 7\.2\.2Huber\-loss interpolation
Figure 95:Differences‖β^−β‖1/n\|\|\\hat\{\\beta\}\-\\beta\|\|\_\{1\}/nfor the estimated coefficient vectorβ^\\hat\{\\beta\}of Huber\-norm interpolation when trained on clean training data and the true coefficient vectorβ\\beta\.





Figure 96:Differences‖β^−β‖1/n\|\|\\hat\{\\beta\}\-\\beta\|\|\_\{1\}/nfor the estimated coefficient vectorβ^\\hat\{\\beta\}of Huber\-norm interpolation when trained on contaminated training data and the true coefficient vectorβ\\beta\.
#### 7\.2\.3Tukey\-loss interpolation
Figure 97:Differences‖β^−β‖1/n\|\|\\hat\{\\beta\}\-\\beta\|\|\_\{1\}/nfor the estimated coefficient vectorβ^\\hat\{\\beta\}of Tukey\-norm interpolation when trained on clean training data and the true coefficient vectorβ\\beta\.





Figure 98:Differences‖β^−β‖1/n\|\|\\hat\{\\beta\}\-\\beta\|\|\_\{1\}/nfor the estimated coefficient vectorβ^\\hat\{\\beta\}of Tukey\-norm interpolation when trained on contaminated training data and the true coefficient vectorβ\\beta\.
#### 7\.2\.4SLTS\-based interpolation
Figure 99:Differences‖β^−β‖1/n\|\|\\hat\{\\beta\}\-\\beta\|\|\_\{1\}/nfor the estimated coefficient vectorβ^\\hat\{\\beta\}of SLTS\-based interpolation when trained on clean training data and the true coefficient vectorβ\\beta\.Figure 100:Differences‖β^−β‖1/n\|\|\\hat\{\\beta\}\-\\beta\|\|\_\{1\}/nfor the estimated coefficient vectorβ^\\hat\{\\beta\}of SLTS when trained on clean training data and the true coefficient vectorβ\\beta\.





Figure 101:Differences‖β^−β‖1/n\|\|\\hat\{\\beta\}\-\\beta\|\|\_\{1\}/nfor the estimated coefficient vectorβ^\\hat\{\\beta\}of SLTS\-based interpolation when trained on contaminated training data and the true coefficient vectorβ\\beta\.





Figure 102:Differences‖β^−β‖1/n\|\|\\hat\{\\beta\}\-\\beta\|\|\_\{1\}/nfor the estimated coefficient vectorβ^\\hat\{\\beta\}of SLTS when trained on contaminated training data and the true coefficient vectorβ\\beta\.
#### 7\.2\.5Boosting\-based interpolation
Figure 103:Differences‖β^−β‖1/n\|\|\\hat\{\\beta\}\-\\beta\|\|\_\{1\}/nfor the estimated coefficient vectorβ^\\hat\{\\beta\}of RRBoosting\-based interpolation when trained on clean training data and the true coefficient vectorβ\\beta\.





Figure 104:Differences‖β^−β‖1/n\|\|\\hat\{\\beta\}\-\\beta\|\|\_\{1\}/nfor the estimated coefficient vectorβ^\\hat\{\\beta\}of RRBoosting\-based interpolation when trained on contaminated training data and the true coefficient vectorβ\\beta\.As one can observe in Fig\.[93](https://arxiv.org/html/2605.21494#S7.F93), Fig\.[94](https://arxiv.org/html/2605.21494#S7.F94), Fig\.[95](https://arxiv.org/html/2605.21494#S7.F95), Fig\.[96](https://arxiv.org/html/2605.21494#S7.F96), Fig\.[97](https://arxiv.org/html/2605.21494#S7.F97), Fig\.[98](https://arxiv.org/html/2605.21494#S7.F98), Fig\.[99](https://arxiv.org/html/2605.21494#S7.F99), Fig\.[101](https://arxiv.org/html/2605.21494#S7.F101), Fig\.[100](https://arxiv.org/html/2605.21494#S7.F100), Fig\.[102](https://arxiv.org/html/2605.21494#S7.F102), Fig\.[103](https://arxiv.org/html/2605.21494#S7.F103)and Fig\.[104](https://arxiv.org/html/2605.21494#S7.F104), the norm difference curves resemble those for the independent design, with slightly higher values\.
### 7\.3Independent features,μ=5\\mu=5
#### 7\.3\.1Minimuml2l\_\{2\}\-norm interpolation
Figure 105:Differences‖β^−β‖1/n\|\|\\hat\{\\beta\}\-\\beta\|\|\_\{1\}/nfor the estimated coefficient vectorβ^\\hat\{\\beta\}of minimuml2l\_\{2\}\-norm interpolation when trained on clean training data and the true coefficient vectorβ\\beta\.





Figure 106:Differences‖β^−β‖1/n\|\|\\hat\{\\beta\}\-\\beta\|\|\_\{1\}/nfor the estimated coefficient vectorβ^\\hat\{\\beta\}of minimuml2l\_\{2\}\-norm interpolation when trained on contaminated training data and the true coefficient vectorβ\\beta\.
#### 7\.3\.2Huber\-loss interpolation
Figure 107:Differences‖β^−β‖1/n\|\|\\hat\{\\beta\}\-\\beta\|\|\_\{1\}/nfor the estimated coefficient vectorβ^\\hat\{\\beta\}of Huber\-loss interpolation when trained on clean training data and the true coefficient vectorβ\\beta\.





Figure 108:Differences‖β^−β‖1/n\|\|\\hat\{\\beta\}\-\\beta\|\|\_\{1\}/nfor the estimated coefficient vectorβ^\\hat\{\\beta\}of Huber\-loss interpolation when trained on contaminated training data and the true coefficient vectorβ\\beta\.
#### 7\.3\.3SLTS\-based interpolation
Figure 109:Differences‖β^−β‖1/n\|\|\\hat\{\\beta\}\-\\beta\|\|\_\{1\}/nfor the estimated coefficient vectorβ^\\hat\{\\beta\}of SLTS\-based interpolation when trained on clean training data and the true coefficient vectorβ\\beta\.Figure 110:Differences‖β^−β‖1/n\|\|\\hat\{\\beta\}\-\\beta\|\|\_\{1\}/nfor the estimated coefficient vectorβ^\\hat\{\\beta\}of SLTS when trained on clean training data and the true coefficient vectorβ\\beta\.





Figure 111:Differences‖β^−β‖1/n\|\|\\hat\{\\beta\}\-\\beta\|\|\_\{1\}/nfor the estimated coefficient vectorβ^\\hat\{\\beta\}of SLTS\-based interpolation when trained on contaminated training data and the true coefficient vectorβ\\beta\.





Figure 112:Differences‖β^−β‖1/n\|\|\\hat\{\\beta\}\-\\beta\|\|\_\{1\}/nfor the estimated coefficient vectorβ^\\hat\{\\beta\}of SLTS when trained on contaminated training data and the true coefficient vectorβ\\beta\.
#### 7\.3\.4Boosting\-based interpolation
Figure 113:Differences‖β^−β‖1/n\|\|\\hat\{\\beta\}\-\\beta\|\|\_\{1\}/nfor the estimated coefficient vectorβ^\\hat\{\\beta\}of RRBoosting\-based interpolation when trained on clean training data and the true coefficient vectorβ\\beta\.





Figure 114:Differences‖β^−β‖1/n\|\|\\hat\{\\beta\}\-\\beta\|\|\_\{1\}/nfor the estimated coefficient vectorβ^\\hat\{\\beta\}of RRBoosting\-based interpolation when trained on contaminated training data and the true coefficient vectorβ\\beta\.Fig\.[105](https://arxiv.org/html/2605.21494#S7.F105), Fig\.[106](https://arxiv.org/html/2605.21494#S7.F106), Fig\.[107](https://arxiv.org/html/2605.21494#S7.F107), Fig\.[108](https://arxiv.org/html/2605.21494#S7.F108), fig\.[109](https://arxiv.org/html/2605.21494#S7.F109), Fig\.[111](https://arxiv.org/html/2605.21494#S7.F111), Fig\.[110](https://arxiv.org/html/2605.21494#S7.F110), Fig\.[112](https://arxiv.org/html/2605.21494#S7.F112), Fig\.[113](https://arxiv.org/html/2605.21494#S7.F113)and Fig\.[114](https://arxiv.org/html/2605.21494#S7.F114)reveal that the coefficient difference curves resemble those from the caseμ=0\\mu=0\.
### 7\.4n=200n=200
#### 7\.4\.1Minimuml2l\_\{2\}\-norm interpolation





Figure 115:Differences‖β^−β‖1/n\|\|\\hat\{\\beta\}\-\\beta\|\|\_\{1\}/nfor the estimated coefficient vectorβ^\\hat\{\\beta\}of minimuml2l\_\{2\}\-norm interpolation when trained onYY\-contaminated training data\.
#### 7\.4\.2Huber\-loss interpolation





Figure 116:Differences‖β^−β‖1/n\|\|\\hat\{\\beta\}\-\\beta\|\|\_\{1\}/nfor the estimated coefficient vectorβ^\\hat\{\\beta\}of Huber\-loss interpolation when trained onYY\-contaminated training data\.The curves in Fig\.[115](https://arxiv.org/html/2605.21494#S7.F115)and Fig\.[116](https://arxiv.org/html/2605.21494#S7.F116)resemble those from the casen=50n=50, of course, the peak is attained later\.
### 7\.5n=200n=200,cout=10000c\_\{out\}=10000
#### 7\.5\.1Minimuml2l\_\{2\}\-norm interpolation





Figure 117:Differences‖β^−β‖1/n\|\|\\hat\{\\beta\}\-\\beta\|\|\_\{1\}/nfor the estimated coefficient vectorβ^\\hat\{\\beta\}of minimuml2l\_\{2\}\-norm interpolation when trained onYY\-contaminated training data\.The curves in Fig\.[117](https://arxiv.org/html/2605.21494#S7.F117)resemble those from the casecout=100c\_\{out\}=100in Fig\.[115](https://arxiv.org/html/2605.21494#S7.F115), but with much larger values\.
#### 7\.5\.2Huber\-loss interpolation





Figure 118:Differences‖β^−β‖1/n\|\|\\hat\{\\beta\}\-\\beta\|\|\_\{1\}/nfor the estimated coefficient vectorβ^\\hat\{\\beta\}of Huber\-loss interpolation when trained onYY\-contaminated training data\.Forr∈\{0\.1,0\.25\}r\\in\\\{0\.1,0\.25\\\}, the curves in Fig\.[118](https://arxiv.org/html/2605.21494#S7.F118)first decrease, as in Fig\.[116](https://arxiv.org/html/2605.21494#S7.F116)for the casecout=100c\_\{out\}=100, but stay constant for largeppinstead of decreasing\. For largerrr, the curves decrease at largerpp\.
## 8Number of iterations
### 8\.1Independent features,μ=0\\mu=0
Figure 119:Mean number of iterations of Huber\-loss interpolation when trained on clean training data\.Figure 120:Mean number of iterations of Tukey\-loss interpolation when trained on clean training data\.





Figure 121:Mean number of iterations of Huber\-loss interpolation when trained on contaminated training data\.For Huber\-loss based interpolation, as one can observe in Fig\.[123](https://arxiv.org/html/2605.21494#S8.F123)and Fig\.[124](https://arxiv.org/html/2605.21494#S8.F124), the number of iterations reaches the allowed maximum of 100 ifppis in the vicinity ofnn\. This plateau is larger for higherrr\. As forYY\-contamination, the number of iterations considerably decreases for growingpp, reaching numbers below 10 eventually forp=5000p=5000\. In the case ofXX\-contamination, the number of iterations only decreases forr=0\.1r=0\.1, while for larger contamination radii, the number remains at the maximum of 100\.






Figure 122:Mean number of iterations of Tukey\-loss interpolation when trained on contaminated training data\.For Tukey\-based interpolation, Fig\.[120](https://arxiv.org/html/2605.21494#S8.F120)reveals that the number of iterations decreases after a peak, but for an SNR of0\.10\.1, it increases again for largepp\. This behaviour can also be observed forYY\-contamination, as visualized in Fig\.[122](https://arxiv.org/html/2605.21494#S8.F122)\. ForXX\-contamination, the number of iterations increases and stays nearly the maximum number of iterations of 100\.
### 8\.2Independent features,μ=5\\mu=5
Figure 123:Mean number of iterations of Huber\-loss interpolation when trained on clean training data\.





Figure 124:Mean number of iterations of Huber\-loss interpolation when trained on contaminated training data\.In contrast to the caseμ=0\\mu=0, the number of iterations stays much longer in the plateau and decreases for largepp, as shown in Fig\.[123](https://arxiv.org/html/2605.21494#S8.F123)and Fig\.[124](https://arxiv.org/html/2605.21494#S8.F124)\. Forr=0\.75r=0\.75, it however increases again asppgrows larger\. Forr=0\.9r=0\.9, the number of iterations stays at its maximum even for very largepp\.
### 8\.3Spiked covariance design,μ=0\\mu=0
Figure 125:Mean number of iterations of Huber\-loss interpolation when trained on clean training data\.Figure 126:Mean number of iterations of Tukey\-loss interpolation when trained on clean training data\.





Figure 127:Mean number of iterations of Huber\-loss interpolation when trained on contaminated training data\.





Figure 128:Mean number of iterations of Tukey\-loss interpolation when trained on contaminated training data\.The curves, depicted in Fig\.[125](https://arxiv.org/html/2605.21494#S8.F125), Fig\.[126](https://arxiv.org/html/2605.21494#S8.F126), Fig\.[127](https://arxiv.org/html/2605.21494#S8.F127)and Fig\.[128](https://arxiv.org/html/2605.21494#S8.F128)resemble those from the independent design\.
### 8\.4n=200n=200
#### 8\.4\.1Huber\-loss interpolation





Figure 129:Mean number of iterations of Huber\-loss interpolation when trained onYY\-contaminated training data\.
### 8\.5n=200n=200,cout=10000c\_\{out\}=10000
#### 8\.5\.1Huber\-loss interpolation





Figure 130:Mean number of iterations of Huber\-loss interpolation when trained onYY\-contaminated training data\.
## 9Discussion and conclusion
### 9\.1Discussion of the results
The evaluation of the test MSEs in Sec\.[5](https://arxiv.org/html/2605.21494#S5)reveals that the minimuml2l\_\{2\}\-norm interpolator indeed shows the double descent behavior, as the MSE drops after the peak atp=np=n, provided a sufficiently high SNR \(at least 2 in the experiments\)\. Although, as expected, the MSE corresponding to models trained on contaminated data is higher than for models trained on clean data, it can be observed that the MSE attains smaller values for largeppthan for lowpp, indicating that the interpolating regime has been reached\. One should note that onYY\-contaminated data, there is no double descent, since there is no descent before the peak but only a descent after the peak\.XX\-contamination in contrast seems to only marginally affect the performance of the minimuml2l\_\{2\}\-norm interpolator\. Huber\-loss interpolation also leads to a double descent on clean data,XX\-contaminated andYY\-contaminated data with small contamination radius, provided a sufficiently large SNR, but the minimmul2l\_\{2\}\-norm interpolator surpasses it for largepp\. A similar behavior can be observed for RRBoosting\-based interpolation\. In contrast, Tukey\-loss interpolation and SLTS\-based interpolation as well as SLTS and RRBoost do not allow for the double descent phenomenon in our experiments, and even forYY\-contaminated or clean data, the MSE not necessarily decreases\.
The behavior is nearly unaffected by the underlying covariance structure, i\.e\., whether the predictors are independent or whether they are distributed according to a spiked covariance scheme\. However, the performance of Huber\-based interpolation degrades once the predictors are not centered, and once the contamination magnitudecoutc\_\{out\}becomes large, while the shape of the generalization error corresponding to minimuml2l\_\{2\}\-norm interpolation remains unaffected\.
As for the correspondence of the training \(Sec\.[6](https://arxiv.org/html/2605.21494#S6)\) and test errors, one can observe that for Huber\-loss interpolation, the \(second\) descent starts roughly once the training error vanishes\. This could be explained by the fact that for small absolute residuals, the Huber loss equals the squared loss, so that Huber\-loss interpolation coincides with minimuml2l\_\{2\}\-norm interpolation in this case\. Forμ=5\\mu=5, i\.e\., non\-centered predictors, one can only observe that the test MSE no longer increases once the training MSE vanishes, but the interpolation does not corresponds to a decrease of the test MSE here\.
There seems to be no correspondence between the shape of the curves depicting the differences‖β−β^‖1\|\|\\beta\-\\hat\{\\beta\}\|\|\_\{1\}\(Sec\.[7](https://arxiv.org/html/2605.21494#S7)\), as they always decrease for growingpp\. We also evaluated the differences‖β−β^‖2\|\|\\beta\-\\hat\{\\beta\}\|\|\_\{2\}and‖β−β^‖∞\|\|\\beta\-\\hat\{\\beta\}\|\|\_\{\\infty\}, but their shape is similar\.
The number of iterations for Huber\- and Tukey\-loss interpolation does not seem to correspond to the generalization errors \(Sec\.[8](https://arxiv.org/html/2605.21494#S8)\)\. ForYY\-contamination and centered predictors, the number of iterations for both algorithms drops quickly after a peak aroundp=np=n, but the test MSE behaves completely differently for Huber\- and Tukey\-loss interpolation\. Moreover, although the test MSE decreases for Huber\-loss interpolation for largepp, the number of iterations remains at its maximum\. For non\-centered predictors, the number of iteration eventually decreases for very largepp, which coincides with the starting point where the test MSE decreases forYY\-contamination andr∈\{0\.1,0\.25\}r\\in\\\{0\.1,0\.25\\\}, but it does neither correspond to the test MSE forr=0\.5r=0\.5nor forXX\-contamination\.
### 9\.2Conclusion and outlook
In this work, we experimentally studied overparametrized regression on contaminated data\. We compared the performance of the minimuml2l\_\{2\}\-norm interpolator, Huber\-loss and Tukey\-loss interpolation as well as SLTS and RRBoost\. We also proposed an interpolation variant on clean subsets, where first SLTS or RRBoost is applied in order to identify a clean subset, on which the minimuml2l\_\{2\}\-norm interpolator is computed\. The contamination also includes gross outliers and considers bothXX\- andYY\-contamination\.
The results reveal a surprising robustness of the minimuml2l\_\{2\}\-norm interpolator, whose generalization performance that of any robust counterpart, disregarding the covariance structure of the predictors, whether the predictors are centered, the contamination radius or the contamination magnitude\. In particular, provided that the SNR is sufficiently large, a double descent phenomenon can be observed, where the test MSE first decreases until it attains its minimum atp=sp=s, increases until a peak atp=np=n, and decreases again forp\>np\>n\. For small SNR, there is no minimum atp=sp=s, but the test MSE decreases as well forp\>np\>n\. For centered predictors and moderate contamination magnitudes, the Huber\-loss interpolator leads to similar test MSE shapes, however, it decreases later than for the minimuml2l\_\{2\}\-norm interpolator\.
It should be a topic for future work to assess whether the theoretical results for the double descent behavior of the minimuml2l\_\{2\}\-norm interpolator can be extended to contaminated data, which, to the best of our knowledge, have not yet been provided\. In particular, proving a double descent for the minimuml2l\_\{2\}\-norm interpolator would be important for practical applications where the data can be assumed to be contaminated\. Forp\>\>np\>\>n, it would imply large computational advantages to apply minimuml2l\_\{2\}\-norm interpolation instead of a“robust”counterpart, which, due to a non\-convex objective, would necessitate an iterative optimization scheme\.
## 10Acknowledgements
The simulations were conducted on the HPC cluster ROSA, located at the University of Oldenburg \(Germany\)\. ROSA was funded by the German Research Foundation \(DFG\) through its Major Research Instrumentation Programme \(INST 184/225\-1 FUGG\) and the Ministry of Science and Culture \(MWK\) of Lower Saxony\.
## ReferencesSimilar Articles
Deep double descent
OpenAI research reveals the 'double descent' phenomenon where test error exhibits a non-monotonic pattern as both model size and training steps increase, challenging traditional understanding of the bias-variance tradeoff in deep learning.
A Single Stepsize Suffices for Unprojected Linear TD(0): Simultaneous Robust and Fast Rates via Polyak--Ruppert Averaging
This paper provides high-probability guarantees for an unprojected linear TD(0) algorithm with Polyak–Ruppert averaging under Markovian sampling, using a single stepsize schedule that achieves both robust curvature-free and fast curvature-dependent convergence rates.
Robust Subspace-Constrained Quadratic Models for Low-Dimensional Structure Learning
This paper proposes a robust subspace-constrained quadratic model for learning low-dimensional structures from high-dimensional data, accommodating heavy-tailed noise. A gradient-based algorithm with backtracking line search is developed, and experiments show improved robustness and reconstruction accuracy.
Flatland: The Adventures of Gradient Descent with Large Step Sizes
This paper addresses the open question of maximum step size for gradient descent convergence on non-L-smooth objectives, introducing adaptive methods that operate at the edge of stability and can minimize sharpness globally.
Convergence of Steepest Descent and Adam under Non-Uniform Smoothness
This paper generalizes non-uniform smoothness assumptions to objectives whose curvature is affine in the objective value, proving convergence rates for steepest descent and diagonal variants of RMSProp and Adam, with applications to logistic regression and neural networks.