On design-unbiased algorithmic Machine Learning
Summary
This paper investigates conditions for algorithmic machine learning (e.g., kNN, random forest) to achieve design-unbiased prediction and classification for finite populations, using probability sampling designs rather than assumed data models. It extends design-based inference from survey sampling to ML algorithms.
View Cached Full Text
Cached at: 06/30/26, 05:29 AM
# On design-unbiased algorithmic Machine Learning
Source: [https://arxiv.org/html/2606.28795](https://arxiv.org/html/2606.28795)
Li\-Chun Zhang111Correspondence: Department of Social Statistics and Demography, Univ\. of Southampton, Highfield SO17 1BJ, Southampton, UK\. Email: L\.Zhang@soton\.ac\.uk, University of Southampton & Statistisk sentralbyrå Siu\-Ming Tam, Australian Bureau of Statistics Luis Sanguiao\-Sande, Instituto Nacional de Estadística Wesley Yung, Statistics Canada \(Ret\.\) Anders Holmberg, Australian Bureau of Statistics
###### Abstract
Machine Learning \(ML\) algorithms, such as k\-Nearest Neighbours \(kNN\) or random forest, eschew the ideal of true data models in favour of predictive performance\. However, minimising the MSE or F\-score cannot lead to unbiasedness directly, which is important in many situations such as official statistics\. We study the conditions of algorithmic ML, other than the existence and knowledge of true data models, which lead to unbiased prediction or classification for a given finite population, including how the training data may be sampled from the population, how a trained prediction algorithm can be tuned to achieve unbiased prediction or classification for that population, and how the performance of out\-of\-sample prediction or classification can be assessed unbiasedly\. The inference is based on the*known*probability design of samples and training sets, rather than any*assumed*distributions or models\.
Key wordspqpq\-design, prediction estimator, debiasing, classification accuracy
## 1Introduction
Breiman \(2001a\) contrasts algorithmic and data modelling cultures, where the ideal of true data models is eschewed in favour of predictive performance of ML algorithms, such as kNN or random forest\. However, minimising the MSE or F\-score cannot directly lead to unbiasedness of prediction or classification, which is important in many finite\-population applications of ML\.
For instance, many official statistics are defined as descriptive summaries of a country’s population, economy, society or environment\. While the quality of official statistics is multi\-dimensional \(e\.g\. United Nations, 2019; European Commission, 2017; Statistics Canada, 2017\), accuracy and reliability as measured in terms of bias and variance, respectively, are central in all the quality frameworks necessary to maintain the public trust\.
As a common approach to producing official statistics, survey sampling shares Breiman’s perspective on ML algorithms\. The focus is to improve the sampling strategy, consisting of a probability sampling design and an associated choice of estimator \(Neyman, 1934\)\. The inference is with respect to the known sampling design, regardless of the existence or knowledge of a true data model\. See Hansen \(1987\), Smith \(1994\), Kalton \(2002\), Rao \(2005, 2011\), Beaumont and Haziza \(2022\) for reviews and appraisals\.
The well\-known Horvitz\-Thompson estimator \(Horvitz and Thompson, 1952\) is the most typical example, which is exactly unbiased over repeated sampling from a given population\. To improve the efficiency by leveraging auxiliary information about the population, i\.e\. in addition to the sampling design, it is popular to adopt a design\-based model\-assisted approach \(e\.g\. Särndal et al\., 1992\), whereby a prediction model relating the target outcome to the known covariates can be introduced to adjust the Horvitz\-Thompson estimator, but the inference of bias and variance is still with respect to the sampling design, without the need for the assisting model to be the true data model\.
Breidt and Opsomer \(2017\) summarise a general “recipe” of model\-assisted estimation, which adjusts the model\-prediction of the population total by the weighted sum of sample prediction residuals, where the weights are inverse sample inclusion probabilities as in the Horvitz\-Thompson estimator\.
In case the assisting model is pre\-trained, i\.e\. not learned on the actual sample, the resulting difference estimator \(e\.g\. Särndal et al\., 1992, Sec\. 6\.3\) is exactly design\-unbiased\. See also Angelopoulos et al\. \(2023\) for a related approach to independent and identically distributed \(IID\) samples, either by stipulation or by with\-replacement sampling from finite populations\.
Otherwise, and more common in practice, estimating the model means that the resulting estimator is no longer exactly unbiased\. To justify any given ML algorithm in this context, some authors resort to the notation of consistency asymptotically, as the population and sample sizes increase to infinity, under suitable regularity conditions\. See McConville and Toth \(2019\) for regression trees generated by the recursive partitioning algorithm, or Dagdoug et al\. \(2023\) for random forest by the original algorithm of Breiman \(2001b\)\.
However, as Smith \(1994\) points out, the “asymptotic notion of consistency” is not immediately applicable to the given population as “a real entity”\. One may observe an alternative notion of consistency following the works of Fisher \(1956\) and Neyman \(1934\)\. For a given population and sampling method, ifθ^\\hat\{\\theta\}is unbiased for the vector of population totalsθ\\theta, theng\(θ^\)g\(\\hat\{\\theta\}\)is called “consistent” forg\(θ\)g\(\\theta\)by Fisher \(1956\); whereas an interval estimator of a population statistic is called “consistent” by Neyman \(1934\), if it achieves the designated level of coverage\. Zhang et al\. \(2025\) refer to such finite\-sample design\-unbiased estimators as*Neyman\-Fisher consistent*\.
Sanguiao\-Sande and Zhang \(2021\) propose an approach to Neyman\-Fisher consistent population total estimation, which is exactly design\-unbiased in the finite\-sample setting\. In particular, given any ML algorithm as the assisting model, one can apply Rao\-Blackwellisation \(Rao, 1945; Blackwell, 1947\) to the Horvitz\-Thompson estimator of the total prediction errors in the population outside the training set to achieve design\-unbiasedness\. The resulting estimator still uses the sampling design weights explicitly, like all other traditional design\-based model\-assisted estimators\.
Zhang et al\. \(2025\) extend the*subsampling Rao\-Blackwell \(SRB\)*technique of Sanguiao\-Sande and Zhang \(2021\) to a larger class of estimators, called the*prediction estimators*\(Royall, 1970; Valliant et al\., 2000\), where a prediction estimator of a population total is the sum of the observed sample total and a predicted out\-of\-sample total\. Since one can plug in any ML algorithm for the latter, the prediction estimator can be constructed by a working model of the population without using the sampling weights at all, although doing so would often cause bias over repeated sampling from the given population\. Applying Rao\-Blackwellisation to the subsample\-trained prediction estimator, Zhang et al\. \(2025\) obtain exactly design\-unbiased estimators of the bias and MSE of the resulting SRB\-prediction estimators\.
However, Zhang et al\. \(2025\) did not study the general conditions of ML that can lead to design\-unbiased prediction estimators\. Nor did they consider the estimation of classification accuracy when the prediction estimator is given by unit\-level classifications of the out\-of\-sample units\.
In this paper, we shall focus on the conditions of ML under finite population sampling, which would yield design\-unbiased estimation of population totals, either by predicting or classifying the out\-of\-sample units\. Since the inference is design\-based, the unbiasedness holds regardless of the ‘truthfulness’ of the assisting model or ML algorithm\. This is especially useful in applications of ML, where unbiasedness for finite populations is of critical importance\.
The specific questions we consider are: \(i\) how the training data may be sampled from the population, so as to enable the inference of out\-of\-sample prediction errors based on the observed in\-sample test errors, \(ii\) how an ML algorithm, which is formed by the training data, can be tuned by the test errors to yield design\-unbiased out\-of\-sample prediction, and \(iii\) how out\-of\-sample unit\-level classification can yield design\-unbiased prediction estimation, and how to assess the associated classification accuracy unbiasedly\.
In the rest of the paper, Sections[2](https://arxiv.org/html/2606.28795#S2)and[4](https://arxiv.org/html/2606.28795#S4)deal with unbiased prediction and classification, respectively\. Sections[3](https://arxiv.org/html/2606.28795#S3)ăand[5](https://arxiv.org/html/2606.28795#S5)present illustrations, simulations and an application of the theory developed, using kNN as the example ML algorithm throughout\. Section[6](https://arxiv.org/html/2606.28795#S6)gives concluding remarks\.
## 2Design\-unbiased prediction
Denote byU=\{1,…,N\}U=\\\{1,\.\.\.,N\\\}a given finite population, with known feature vectorxix\_\{i\}for eachi∈Ui\\in U\. Denote byssthe sample ofnnunits with observed*outcome*yiy\_\{i\}for eachi∈si\\in s, whereasyiy\_\{i\}is unknown for the rest of the population\. We treat all the values\{\(yi,xi\):i∈U\}\\\{\(y\_\{i\},x\_\{i\}\):i\\in U\\\}as constants, whether they are known or not\. The variation of ML and prediction is due to the following two elements\.
First, let the samplessbe selected fromUUby a probability sampling design, to be referred to as the*pp\-design*and denoted by
wherep\(s\)p\(s\)sums to 1 over all possibless, andπi=Pr\(i∈s\)\>0\\pi\_\{i\}=\\Pr\(i\\in s\)\>0for anyi∈Ui\\in U\. Next, for algorithmic ML, lets1s\_\{1\}be the*training*set taken fromss, and lets2=s∖s1s\_\{2\}=s\\setminus s\_\{1\}be the corresponding*test*set, which are created according to a specific design, to be referred to as the*qq\-design*and denoted by
s1∼q\(s1∣s\)\.s\_\{1\}\\sim q\(s\_\{1\}\\mid s\)\.For instance,s1s\_\{1\}of sizen1n\_\{1\}may be selected fromssby simple random sampling without replacement — simply SRS from now on\. Or,s1s\_\{1\}may be created by bootstrap ofss, in which case a given unit inssmay be selected multiple times ins1s\_\{1\}, ands2s\_\{2\}contains the units inssnot selected tos1s\_\{1\}at all\. Or, byLL\-folding, one would createLLtest setss2s\_\{2\}by a randomLL\-partition ofss, and theLLtraining setss1=s∖s2s\_\{1\}=s\\setminus s\_\{2\}are determined accordingly\.
In any case, we consider thepqpq\-design to be well\-defined for inference, if the joint distribution of\(s1,s\)\(s\_\{1\},s\)admits the factorisation
q\(s1∣s\)p\(s\)=f\(s1\)f\(s∣s1\)q\(s\_\{1\}\\mid s\)p\(s\)=f\(s\_\{1\}\)f\(s\\mid s\_\{1\}\)\(1\)such that the non\-empty test sets2s\_\{2\}can be regarded as a probability sample fromU∖s1U\\setminus s\_\{1\}conditional ons1s\_\{1\}\. Notice that the trivialqq\-design,Pr\(s1=s∣s\)=1\\Pr\(s\_\{1\}=s\\mid s\)=1, does not yield a well\-definedpqpq\-design sinces2≡∅s\_\{2\}\\equiv\\emptyset, although it is feasible for ML with default setups, such as kNN with pre\-fixedkk\. Well\-definedpqpq\-designs, however, are necessary for the inference of ML\.
As a typical target of interest, let us consider the populationyy\-total, which can be decomposed into the observed sample total and the unknown total in the rest of the population, denoted by
Y=∑i∈Uyi=∑i∈syi\+∑i∉syi=∑i∈syi\+YRY=\\sum\_\{i\\in U\}y\_\{i\}=\\sum\_\{i\\in s\}y\_\{i\}\+\\sum\_\{i\\notin s\}y\_\{i\}=\\sum\_\{i\\in s\}y\_\{i\}\+Y\_\{R\}where the subscript denotesR=U∖sR=U\\setminus s\. Let a*prediction estimator*ofYYbe
Y^=∑i∈syi\+∑i∉sy^i=∑i∈syi\+Y^R\\hat\{Y\}=\\sum\_\{i\\in s\}y\_\{i\}\+\\sum\_\{i\\notin s\}\\hat\{y\}\_\{i\}=\\sum\_\{i\\in s\}y\_\{i\}\+\\hat\{Y\}\_\{R\}wherey^i\\hat\{y\}\_\{i\}can be given by any ML prediction or classification algorithm\.
In particular, as the out\-of\-sample totalYRY\_\{R\}varies withssunder repeated samplings∼p\(s\)s\\sim p\(s\), unbiased estimation ofYY\(as a constant\) is equivalent to unbiased prediction ofYRY\_\{R\}\(as a random variable\), denoted by
Ep\(Y^\)=Y⇔Ep\(Y^R−YR\)=0\.E\_\{p\}\(\\hat\{Y\}\)=Y\\quad\\Leftrightarrow\\quad E\_\{p\}\(\\hat\{Y\}\_\{R\}\-Y\_\{R\}\)=0~\.
### 2\.1Representative training
Denote byμ\(x,s1\)\\mu\(x,s\_\{1\}\)a predictor obtained from the training data\{\(yj,xj:j∈s1\}\\\{\(y\_\{j\},x\_\{j\}:j\\in s\_\{1\}\\\}, which is aimed at the outcomeyy\-value given the associated feature vectorxx\. The notation signifies that, for any unit outside the training set,i∉s1i\\notin s\_\{1\}withxi=xx\_\{i\}=x, its predictedyy\-valueμ\(x,s1\)\\mu\(x,s\_\{1\}\)varies only withxxands1s\_\{1\}\.
#### Definition
A well\-definedpqpq\-design is said to yield*representative training*ofμ\(x,s1\)\\mu\(x,s\_\{1\}\)if,∀i∈U\\forall i\\in U, we have
Epq\(μ\(xi,s1\)∣i∈s2\)=Epq\(μ\(xi,s1\)∣i∉s\)\.E\_\{pq\}\\big\(\\mu\(x\_\{i\},s\_\{1\}\)\\mid i\\in s\_\{2\}\\big\)=E\_\{pq\}\\big\(\\mu\(x\_\{i\},s\_\{1\}\)\\mid i\\notin s\\big\)\.\(2\)
In other words, representative training is the case, by the givenpqpq\-design, if the expected predictorμ\(xi,s1\)\\mu\(x\_\{i\},s\_\{1\}\)is the same for any uniti∈Ui\\in Uconditional on its being outside the training set, regardless whether the unit needs to be predicted \(i\.e\.i∉si\\notin s\) or is observed and may be used for inference \(i\.e\.i∈s2i\\in s\_\{2\}\)\. For simplicity, we shall say a unitiiis*out\-of\-bag \(OOB\)*if it is in the sample but outside the training set, i\.e\.i∈s2i\\in s\_\{2\}, whereas it is out\-of\-sample ifi∉si\\notin s\.
The concept of representative training \([2](https://arxiv.org/html/2606.28795#S2.E2)\) is intuitive, since it allows one to connect the unobserved out\-of\-sample performance ofμ\(x,s1\)\\mu\(x,s\_\{1\}\)to its observed in\-sample OOB performance\. For instance, conditional ons1s\_\{1\}, all the prediction errorsμ\(xi,s1\)−yi\\mu\(x\_\{i\},s\_\{1\}\)\-y\_\{i\}can be partitioned according toU∖s1=s2∪RU\\setminus s\_\{1\}=s\_\{2\}\\cup R, wheres2s\_\{2\}is a sample fromU∖s1U\\setminus s\_\{1\}with respect to thepqpq\-design, which would allow us to make inference of the prediction bias ofY^R=∑i∈Rμ\(xi,s1\)\\hat\{Y\}\_\{R\}=\\sum\_\{i\\in R\}\\mu\(x\_\{i\},s\_\{1\}\)later\.
###### Lemma 1\.
A well\-definedpqpq\-design yields representative training,
- •for all possible ML algorithmsμ\(x,s1\)\\mu\(x,s\_\{1\}\)if and only if,∀i∈U\\forall i\\in U, we have π2i≔Pr\(i∈s2∣s1\)=πi−Pr\(i∈s1\)1−Pr\(i∈s1\)𝕀\(i∉s1\)\\pi\_\{2i\}\\coloneq\\Pr\(i\\in s\_\{2\}\\mid s\_\{1\}\)=\\frac\{\\pi\_\{i\}\-\\Pr\(i\\in s\_\{1\}\)\}\{1\-\\Pr\(i\\in s\_\{1\}\)\}\\mathbb\{I\}\(i\\notin s\_\{1\}\)\(3\)
- •for any given ML algorithmμ\(x,s1\)\\mu\(x,s\_\{1\}\)if and only if,∀i∈U\\forall i\\in U, we have Covs1\(μ\(xi,s1\),π2i∣i∉s1\)=0\.\\text\{Cov\}\_\{s\_\{1\}\}\\big\(\\mu\(x\_\{i\},s\_\{1\}\),\\pi\_\{2i\}\\mid i\\notin s\_\{1\}\\big\)=0\.\(4\)
The proof is given in Appendix[A](https://arxiv.org/html/2606.28795#A1), as are the proofs of other results in the paper\. Since condition \([3](https://arxiv.org/html/2606.28795#S2.E3)\) implies \([4](https://arxiv.org/html/2606.28795#S2.E4)\), the latter may be regarded as a more general condition for representative training: it means that*any*well\-definedpqpq\-design may yield representative training of certain models, i\.e\. those satisfying \([4](https://arxiv.org/html/2606.28795#S2.E4)\)\. However, one may be unable to verify the condition \([4](https://arxiv.org/html/2606.28795#S2.E4)\), which requires all the relevant unconditional and conditional sampling probabilities to be known\. The condition \([3](https://arxiv.org/html/2606.28795#S2.E3)\) is thus more readily applicable, given which thepqpq\-design guarantees representative training of*all*possible ML algorithms\.
###### Corollary 1\.
Givenssby SRS fromUUands1s\_\{1\}by SRS fromss, the SRS\-SRSpqpq\-design satisfies \([3](https://arxiv.org/html/2606.28795#S2.E3)\)\.
###### Example 1\.
LetU=\{i1,i2,i3,i4\}U=\\\{i\_\{1\},i\_\{2\},i\_\{3\},i\_\{4\}\\\}, and\(n,n1\)=\(2,1\)\(n,n\_\{1\}\)=\(2,1\)for the SRS\-SRSpqpq\-design\. We haveπ2i=n−n1N−n1=13\\pi\_\{2i\}=\\frac\{n\-n\_\{1\}\}\{N\-n\_\{1\}\}=\\frac\{1\}\{3\},∀i∈U\\forall i\\in U, which satisfies \([3](https://arxiv.org/html/2606.28795#S2.E3)\) sinceπi=24\\pi\_\{i\}=\\frac\{2\}\{4\}andPr\(i∈s1\)=14\\Pr\(i\\in s\_\{1\}\)=\\frac\{1\}\{4\}\.
- •Giveni1∈s2i\_\{1\}\\in s\_\{2\}, we haves=\{i1,i2\},\{i1,i3\},\{i1,i4\}s=\\\{i\_\{1\},i\_\{2\}\\\},\\\{i\_\{1\},i\_\{3\}\\\},\\\{i\_\{1\},i\_\{4\}\\\}, yielding 3 distinct training setss1=\{i2\},\{i3\},\{i4\}s\_\{1\}=\\\{i\_\{2\}\\\},\\\{i\_\{3\}\\\},\\\{i\_\{4\}\\\}withouti1i\_\{1\}, each with probabilityf\(s1∣i1∈s2\)=13f\(s\_\{1\}\\mid i\_\{1\}\\in s\_\{2\}\)=\\frac\{1\}\{3\}\.
- •Giveni1∉si\_\{1\}\\notin s, we haves=\{i2,i3\},\{i2,i4\},\{i3,i4\}s=\\\{i\_\{2\},i\_\{3\}\\\},\\\{i\_\{2\},i\_\{4\}\\\},\\\{i\_\{3\},i\_\{4\}\\\}, yielding 3 distinct training setss1=\{i2\},\{i3\},\{i4\}s\_\{1\}=\\\{i\_\{2\}\\\},\\\{i\_\{3\}\\\},\\\{i\_\{4\}\\\}, each with probabilityf\(s1∣i1∉s\)=13f\(s\_\{1\}\\mid i\_\{1\}\\notin s\)=\\frac\{1\}\{3\}\. Notice that the same training set can occur in differentsswithouti1i\_\{1\}, such ass1=\{i2\}s\_\{1\}=\\\{i\_\{2\}\\\}ins=\{i2,i3\}s=\\\{i\_\{2\},i\_\{3\}\\\}ors=\{i2,i4\}s=\\\{i\_\{2\},i\_\{4\}\\\}, but the probability ofs1=\{i2\}s\_\{1\}=\\\{i\_\{2\}\\\}is still26=13\\frac\{2\}\{6\}=\\frac\{1\}\{3\}\.
It is clear that, given anyμ\\mutrainable ons1s\_\{1\}, the expectation ofμ\(xi,s1\)\\mu\(x\_\{i\},s\_\{1\}\)is the same, either conditional oni∈s2i\\in s\_\{2\}ori∉si\\notin s, for anyi∈Ui\\in U, i\.e\. representative training\.
Apart from SRS, it is common to generate\(s1,s2\)\(s\_\{1\},s\_\{2\}\)byLL\-folding or bootstrap sampling\. ForLL\-folding, suppose \(i\)n2=n/Ln\_\{2\}=n/Lis an integer given the sample sizennofss, and \(ii\) theLLsets ofs2s\_\{2\}are drawn sequentially fromssby SRS\. We would then have the samef\(s1\)f\(s\_\{1\}\)by SRS\-LL\-folding as by the SRS\-SRSpqpq\-design, and representative training of all possibleμ\(x,s1\)\\mu\(x,s\_\{1\}\); whereas, without \(i\) and \(ii\), representative training can be the case approximately\.
Next, when\(s1,s2\)\(s\_\{1\},s\_\{2\}\)are given by bootstrap sampling ofss, the probabilityf\(s1\)f\(s\_\{1\}\)would depend on the number of distinct units ins1s\_\{1\}, denoted bymm, as well as their realised frequencies ins1s\_\{1\}\. The expected frequency is the same for each distinct unit, and it is well known thatm/nm/ntends to1−e−11\-e^\{\-1\}asnnincreases\. By the symmetry of this SRS\-bootstrappqpq\-design,π2i=Pr\(i∈s2∣s1\)\\pi\_\{2i\}=\\Pr\(i\\in s\_\{2\}\\mid s\_\{1\}\)is approximately a constant fori∉s1i\\notin s\_\{1\}, given sufficiently largenn, and we have approximately representative training for all possibleμ\(x,s1\)\\mu\(x,s\_\{1\}\)\.
###### Corollary 2\.
Letssbe given by Poisson sampling fromUU, ands1s\_\{1\}by Bernoulli sampling fromss\. The Poisson\-Bernoullipqpq\-design satisfies \([3](https://arxiv.org/html/2606.28795#S2.E3)\)\.
Corollary[2](https://arxiv.org/html/2606.28795#Thmcoro2)suggests that, for an unequal\-probabilitypp\-design with fixed sample sizenn, a negligible sampling fractionn/Nn/Ncan yield approximately representative training as long as one adopts an equal\-probabilityqq\-design\.
### 2\.2Prediction unbiasedness and tuning
The set of out\-of\-sample unitsR=U∖sR=U\\setminus svaries withs∼p\(s\)s\\sim p\(s\)\. The predictorμ\(x,s1\)\\mu\(x,s\_\{1\}\)obtained froms1s\_\{1\}ispqpq\-unbiased for out\-of\-sample prediction, if
Epq\(∑i∈Rμ\(xi,s1\)\)=Ep\(∑i∈Ryi\)E\_\{pq\}\\Big\(\\sum\_\{i\\in R\}\\mu\(x\_\{i\},s\_\{1\}\)\\Big\)=E\_\{p\}\\Big\(\\sum\_\{i\\in R\}y\_\{i\}\\Big\)\(5\)over all the out\-of\-sample units\. We leave it to future studies to investigate if design\-unbiased prediction of unit\-specificyiy\_\{i\}can be achieved sensibly\.
Below we provide a sufficient condition for \([5](https://arxiv.org/html/2606.28795#S2.E5)\) given representative training bypqpq\-design\. Consequently, we show how it can be used to*tune*the givenμ\(x,s1\)\\mu\(x,s\_\{1\}\)to achieve unbiased prediction\.
###### Proposition 1\.
Given representative training ofμ\(x,s1\)\\mu\(x,s\_\{1\}\)by thepqpq\-design, it ispqpq\-unbiased for out\-of\-sample prediction \([5](https://arxiv.org/html/2606.28795#S2.E5)\) if, for anys∼p\(s\)s\\sim p\(s\), we have
∑i∈s\(1πi−1\)Eq\(μ\(xi,s1\)∣i∈s2\)=∑i∈s\(1πi−1\)yi\.\\sum\_\{i\\in s\}\\big\(\\tfrac\{1\}\{\\pi\_\{i\}\}\-1\\big\)E\_\{q\}\\big\(\\mu\(x\_\{i\},s\_\{1\}\)\\mid i\\in s\_\{2\}\\big\)=\\sum\_\{i\\in s\}\\big\(\\tfrac\{1\}\{\\pi\_\{i\}\}\-1\\big\)y\_\{i\}\.\(6\)
###### Corollary 3\.
Under the SRS\-SRSpqpq\-design, the predictorμ\(x,s1\)\\mu\(x,s\_\{1\}\)ispqpq\-unbiased for out\-of\-sample prediction if, for anys∼p\(s\)s\\sim p\(s\), we have
∑i∈sEq\(μ\(xi,s1\)∣i∈s2\)=∑i∈syi\.\\sum\_\{i\\in s\}E\_\{q\}\\big\(\\mu\(x\_\{i\},s\_\{1\}\)\\mid i\\in s\_\{2\}\\big\)=\\sum\_\{i\\in s\}y\_\{i\}\.\(7\)
The*qq\-benchmarking*condition \([6](https://arxiv.org/html/2606.28795#S2.E6)\) is stated for the given sampless, without any unobservedyy\-values\. It is likely that the equality will not hold exactly in practice, unless the predictor is rather simplistic, such as training\-set mean given an SRS\-SRSpqpq\-design\. Although this does not imply that the predictorμ\(x,s1\)\\mu\(x,s\_\{1\}\)must be biased, since the condition is sufficient not necessary, a non\-negligible discrepancy between the two sides of \([6](https://arxiv.org/html/2606.28795#S2.E6)\) may be a reason for tuning to achieve out\-of\-sample prediction unbiasedness\.
Meanwhile, let the*subsampling Rao\-Blackwell \(SRB\) predictor*of any out\-of\-sample unit withxi=xx\_\{i\}=xbe given as
μ¯\(x,s\)=Eq\(μ\(x,s1\)∣s\)\.\\bar\{\\mu\}\(x,s\)=E\_\{q\}\\big\(\\mu\(x,s\_\{1\}\)\\mid s\\big\)\.\(8\)By definition,μ¯\(x,s\)\\bar\{\\mu\}\(x,s\)is more efficient thanμ\(x,s1\)\\mu\(x,s\_\{1\}\)because it removes the extra training varianceVq\(μ\(x,s1\)∣s\)V\_\{q\}\\big\(\\mu\(x,s\_\{1\}\)\\mid s\\big\), and it has the same out\-of\-sample prediction bias asμ\(x,s1\)\\mu\(x,s\_\{1\}\)which can be removed as follows\.
###### Proposition 2\.
Given a representative trainingpqpq\-design, withπi=Pr\(i∈s\)\\pi\_\{i\}=\\Pr\(i\\in s\)andq2i≔Pr\(i∈s2∣i∈s\)q\_\{2i\}\\coloneq\\Pr\(i\\in s\_\{2\}\\mid i\\in s\), wherenRn\_\{R\}is the number of out\-of\-sample units, the predictorμ~\(x,s\)\\tilde\{\\mu\}\(x,s\)ispqpq\-unbiased for out\-of\-sample prediction, where
μ~\(x,s\)=μ¯\(x,s\)−τμ\(s\)\\tilde\{\\mu\}\(x,s\)=\\bar\{\\mu\}\(x,s\)\-\\tau\_\{\\mu\}\(s\)and
τμ\(s\)=1nR\{Eq\(∑i∈s2\(1πi−1\)1q2iμ\(xi,s1\)\)−∑i∈s\(1πi−1\)yi\}\.\\tau\_\{\\mu\}\(s\)=\\tfrac\{1\}\{n\_\{R\}\}\\Big\\\{E\_\{q\}\\Big\(\\sum\_\{i\\in s\_\{2\}\}\\big\(\\tfrac\{1\}\{\\pi\_\{i\}\}\-1\\big\)\\tfrac\{1\}\{q\_\{2i\}\}\\mu\(x\_\{i\},s\_\{1\}\)\\Big\)\-\\sum\_\{i\\in s\}\\big\(\\tfrac\{1\}\{\\pi\_\{i\}\}\-1\\big\)y\_\{i\}\\Big\\\}\.\(9\)
###### Corollary 4\.
Given the SRS\-SRSpqpq\-design, any given SRB\-predictor \([8](https://arxiv.org/html/2606.28795#S2.E8)\) can be tuned by \([9](https://arxiv.org/html/2606.28795#S2.E9)\) to be unbiased for out\-of\-sample prediction \([5](https://arxiv.org/html/2606.28795#S2.E5)\), whereτμ\(s\)\\tau\_\{\\mu\}\(s\)is the average in\-sample OOB prediction error which is given as
τμ\(s\)=Eq\(1n2∑i∈s2\(μ\(xi,s1\)−yi\)\)=1n∑i∈sEq\(μ\(xi,s1\)∣i∈s2\)−1n∑i∈syi\.\\tau\_\{\\mu\}\(s\)=E\_\{q\}\\Big\(\\tfrac\{1\}\{n\_\{2\}\}\\sum\_\{i\\in s\_\{2\}\}\\big\(\\mu\(x\_\{i\},s\_\{1\}\)\-y\_\{i\}\\big\)\\Big\)=\\tfrac\{1\}\{n\}\\sum\_\{i\\in s\}E\_\{q\}\\big\(\\mu\(x\_\{i\},s\_\{1\}\)\\mid i\\in s\_\{2\}\\big\)\-\\tfrac\{1\}\{n\}\\sum\_\{i\\in s\}y\_\{i\}~\.\(10\)
## 3k\-Nearest Neighbor \(kNN\) prediction
Let us illustrate with kNN as a simple ML algorithm\. Given feature vectorxxand donor setssof sizenn, let𝒩k\(x;s\)\\mathcal\{N\}\_\{k\}\(x;s\)denote the set ofkknearest neighbours \(or units inss\) in terms ofxx, such that the kNN predictor can be given as
η\(x,s\)=1k∑j∈𝒩k\(x;s\)yj\.\\eta\(x,s\)=\\frac\{1\}\{k\}\\sum\_\{j\\in\\mathcal\{N\}\_\{k\}\(x;s\)\}y\_\{j\}~\.
According to Stone \(1977, Theorem 2 and Corollary 3\), the kNN predictor is IID\-consistent forEXY\(y∣x\)E\_\{XY\}\(y\\mid x\)for any fixedxx\-value, as both the sample sizennandkktend to∞\\infty, whereEXYE\_\{XY\}denotes model expectation over the distribution of\(x,y\)\(x,y\)\. While the IID assumption may not be unreasonable in many applications, it seems unrealistic to requirek→∞k\\rightarrow\\inftyin finite\-sample settings where kNN is typically applied with a small fixedkk\.
Let us apply design\-based prediction theory under the SRS\-SRSpqpq\-design, where explicit results are available, which can as well be instructive for those interested in IID\-model inference\. Let the population mean and its generic prediction estimator be given as, respectively,
P=1N∑i∈UyiandP^=1N\(∑i∈syi\+∑i∉sμ^i\)withμ^i=μ\(xi,s\)\.P=\\frac\{1\}\{N\}\\sum\_\{i\\in U\}y\_\{i\}\\qquad\\text\{and\}\\qquad\\hat\{P\}=\\frac\{1\}\{N\}\\Big\(\\sum\_\{i\\in s\}y\_\{i\}\+\\sum\_\{i\\notin s\}\\hat\{\\mu\}\_\{i\}\\Big\)\\quad\\text\{with\}\\quad\\hat\{\\mu\}\_\{i\}=\\mu\(x\_\{i\},s\)~\.
### 3\.1Debiasing kNN
Let us consider the following baseline, or kNN\-based, predictors\.
Sample\-mean \(baseline\):μ^i=y¯s=1n∑i∈syi\\displaystyle\\hat\{\\mu\}\_\{i\}=\\bar\{y\}\_\{s\}=\\tfrac\{1\}\{n\}\\sum\_\{i\\in s\}y\_\{i\}Sample\-kNN:μ^i=η\(xi,s\)\\displaystyle\\hat\{\\mu\}\_\{i\}=\\eta\(x\_\{i\},s\)SRB\-kNN:\\displaystyle\\text\{SRB\-kNN\}:\\quadμ^i=η¯\(xi,s\)=Eq\(η\(xi,s1\)∣s\)\\displaystyle\\hat\{\\mu\}\_\{i\}=\\bar\{\\eta\}\(x\_\{i\},s\)=E\_\{q\}\\big\(\\eta\(x\_\{i\},s\_\{1\}\)\\mid s\\big\)SRB\-kNN, residual\-tuned:μ^i=η¯\(xi,s\)−\{1n∑k∈sη¯\(xk,s\)−y¯s\}\\displaystyle\\hat\{\\mu\}\_\{i\}=\\bar\{\\eta\}\(x\_\{i\},s\)\-\\Big\\\{\\tfrac\{1\}\{n\}\\sum\_\{k\\in s\}\\bar\{\\eta\}\(x\_\{k\},s\)\-\\bar\{y\}\_\{s\}\\Big\\\}SRB\-kNN, OOB\-tuned:μ^i=η¯\(xi,s\)−\{1n∑k∈sη˙\(xk,s\)−y¯s\}\\displaystyle\\hat\{\\mu\}\_\{i\}=\\bar\{\\eta\}\(x\_\{i\},s\)\-\\Big\\\{\\tfrac\{1\}\{n\}\\sum\_\{k\\in s\}\\dot\{\\eta\}\(x\_\{k\},s\)\-\\bar\{y\}\_\{s\}\\Big\\\}whereη˙\(xk,s1\)\\dot\{\\eta\}\(x\_\{k\},s\_\{1\}\)is the OOB SRB\-predictor for anyk∈sk\\in s, i\.e\.
η˙\(xk,s\)=Eq\(η\(xk,s1\)∣k∈s2\)\.\\dot\{\\eta\}\(x\_\{k\},s\)=E\_\{q\}\\big\(\\eta\(x\_\{k\},s\_\{1\}\)\\mid k\\in s\_\{2\}\\big\)~\.
The sample\-mean yieldspqpq\-unbiased prediction ofY¯R=1N−n∑i∉syi\\bar\{Y\}\_\{R\}=\\frac\{1\}\{N\-n\}\\sum\_\{i\\notin s\}y\_\{i\}, but it does not use thexx\-feature information\. Both sample\-kNN and SRB\-kNN are likely to bepqpq\-biased: while we are not aware of any exactly unbiased estimator of the bias of sample\-kNN, an unbiased bias estimator for SRB\-kNN is given by the term in\{⋅\}\\\{\\cdot\\\}of the OOB\-tuned SRB\-kNN according to the theory above\. The residual\-tuned SRB\-kNN is not unbiased, although it may have a smaller bias than SRB\-kNN, whereas the OOB\-tuned SRB\-kNN is exactlypqpq\-unbiased\. Let us illustrate with a simple example, where exact calculation is possible\.
###### Example 2\.
In a populationUUof sizeNN, letxi=1x\_\{i\}=1ifi=1,2i=1,2andxi=0x\_\{i\}=0ifi≠1,2i\\neq 1,2\. Lety1=1y\_\{1\}=1andy=0y=0ifi≠1i\\neq 1\. Suppose an SRS\-SRSpqpq\-design withk=1k=1for the kNN predictor\. By straightforward though tedious calculation \(Appendix[B](https://arxiv.org/html/2606.28795#A2)\), we can show that the OOB\-tuned SRB\-kNN is exactly unbiased, whereas the prediction bias ofY¯R\\bar\{Y\}\_\{R\}by the different kNN\-predictors are given as
Bias\(sample\-kNN\)=1N−1\(nN−1−1\),\\displaystyle=\\frac\{1\}\{N\-1\}\\Big\(\\frac\{n\}\{N\-1\}\-1\\Big\)~,Bias\(SRB\-kNN\)=1N−1\(n1N−1−1\),\\displaystyle=\\frac\{1\}\{N\-1\}\\Big\(\\frac\{n\_\{1\}\}\{N\-1\}\-1\\Big\)~,Bias\(residual\-tuned SRB\-kNN\)=n1=n21N−1\(5n4\(N−1\)−1\)\.\\displaystyle\\overset\{n\_\{1\}=n\_\{2\}\}\{=\}\\frac\{1\}\{N\-1\}\\Big\(\\frac\{5n\}\{4\(N\-1\)\}\-1\\Big\)~\.Notice that the bias of residual\-tuned SRB\-kNN depends on the choice of\(n1,n2\)\(n\_\{1\},n\_\{2\}\), and the result given above refers to the case ofn1=n2n\_\{1\}=n\_\{2\}\.
### 3\.2A simulation study
We can use simulations to illustrate the potential bias of using kNN predictors and design\-based debiasing\. Let a population of values be generated as below:
yi=1\+0\.5x1i\+0\.3x2i\+6max\(0,x1i\)2\+ϵi,\\displaystyle y\_\{i\}=1\+0\.5x\_\{1i\}\+0\.3x\_\{2i\}\+6\\max\(0,x\_\{1i\}\)^\{2\}\+\\epsilon\_\{i\}~,x1i,x2i∼IIDN\(0,1\),Cov\(x1i,x2i\)=0\.5andϵi∼IIDN\(0,0\.2\)\.\\displaystyle x\_\{1i\},x\_\{2i\}\\overset\{\\text\{IID\}\}\{\\sim\}N\(0,1\)~,~Cov\(x\_\{1i\},x\_\{2i\}\)=0\.5\\quad\\text\{and\}\\quad\\epsilon\_\{i\}\\overset\{\\text\{IID\}\}\{\\sim\}N\(0,0\.2\)~\.Figure[1](https://arxiv.org/html/2606.28795#S3.F1)depicts such a simulated population of sizeN=250N=250\. As can be seen, there is a dramatic ‘bend’ ofE\(yi∣x1i\)E\(y\_\{i\}\\mid x\_\{1i\}\)somewhere betweenx1i=0x\_\{1i\}=0andx1i=1x\_\{1i\}=1, which can potentially cause bias in kNN\-prediction since the realisedyy\-values are rather imbalanced on either side ofx1ix\_\{1i\}asx1ix\_\{1i\}increases\.
Figure 1:Scatter plots of simulated population,N=250N=250\.We now conduct a Monte Carlo simulation withBBsamplesssdrawn by SRS from a simulated population, with population sizeN=104N=10^\{4\}and sample sizen=103n=10^\{3\}\. Generically, letθ^\(1\),…,θ^\(B\)\\hat\{\\theta\}^\{\(1\)\},\.\.\.,\\hat\{\\theta\}^\{\(B\)\}be the realised values of a given population mean prediction estimator over theBBsimulated samples\. Given the population meanPP, let the*empirical*bias and MSE ofθ^\\hat\{\\theta\}by simulation be
EBias\(θ^\)=1B∑b=1B\(θ^\(b\)−P\)andEMSE\(θ^\)=1B∑b=1B\(θ^\(b\)−P\)2\.\\text\{EBias\}\(\\hat\{\\theta\}\)=\\frac\{1\}\{B\}\\sum\_\{b=1\}^\{B\}\\big\(\\hat\{\\theta\}^\{\(b\)\}\-P\\big\)\\qquad\\text\{and\}\\qquad\\text\{EMSE\}\(\\hat\{\\theta\}\)=\\frac\{1\}\{B\}\\sum\_\{b=1\}^\{B\}\\big\(\\hat\{\\theta\}^\{\(b\)\}\-P\\big\)^\{2\}\.In addition, let the*empirical*standard error ofθ^\\hat\{\\theta\}by simulation be
ESE\(θ^\)=1B−1∑b=1B\(θ^\(b\)−1B∑b=1Bθ^\(b\)\)2\.\\text\{ESE\}\(\\hat\{\\theta\}\)=\\sqrt\{\\frac\{1\}\{B\-1\}\\sum\_\{b=1\}^\{B\}\\big\(\\hat\{\\theta\}^\{\(b\)\}\-\\frac\{1\}\{B\}\\sum\_\{b=1\}^\{B\}\\hat\{\\theta\}^\{\(b\)\}\\big\)^\{2\}\}\.
In caseθ^\\hat\{\\theta\}refers to an SRB\-kNN prediction estimator, we can estimate its bias and MSE as described in Appendix[C](https://arxiv.org/html/2606.28795#A3)\. Letbias¯\(θ^\)\\overline\{\\text\{bias\}\}\(\\hat\{\\theta\}\)be the average of theBBbias estimates andmse¯\(θ^\)\\overline\{\\text\{mse\}\}\(\\hat\{\\theta\}\)that of the MSE estimate\. Next,mse¯\(θ^\)\\overline\{\\text\{mse\}\}\(\\hat\{\\theta\}\)is available whenθ^\\hat\{\\theta\}is the unbiased sample\-meany¯s\\bar\{y\}\_\{s\}\. Finally,bias¯\\overline\{\\text\{bias\}\}andmse¯\\overline\{\\text\{mse\}\}of the sample\-5NN are omitted in the results below, since we are unaware of any exactly unbiased estimator of the bias or MSE of sample\-kNN\.
Table 1:Results by SRS\-SRSpqpq\-design,B=100B=100\.Table[1](https://arxiv.org/html/2606.28795#S3.T1)shows the simulation results, based onB=100B=100samplesss, where kNN\-prediction usesk=5k=5\. For the SRB\-kNN predictors, we usedT=100T=100for the Monte Carlo SRB andn1=50n\_\{1\}=50as the training set size; see Appendix[D](https://arxiv.org/html/2606.28795#A4)for details\. Notice that the subsample donor set sizen1=50n\_\{1\}=50is rather small compared to the sample sizen=1000n=1000, which is chosen here purely for illustration purposes, because the bias of kNN\-prediction tends to increase in magnitude as the donor set size decreases, i\.e\.n1n\_\{1\}for subsample\-kNN ornnfor sample\-kNN\.
We note the following about these results\. Firstly, the population mean isP=4\.107P=4\.107in this case\. The bias of the sample\-5NN or SRB\-5NN prediction estimator is not negligible: EBias is seen to dominate the corresponding ESE for both, which is−0\.225\-0\.225versus0\.0280\.028for the sample\-5NN and−0\.795\-0\.795versus0\.3000\.300for SRB\-5NN\.
Secondly, the residual\-tuned SRB\-kNN is not unbiased, while the OOB\-tuned SRB\-kNN and the sample\-mean are both unbiased\. However, comparing the three EBiases by simulation, i\.e\.−0\.099\-0\.099versus−0\.088\-0\.088or−0\.069\-0\.069in Table[1](https://arxiv.org/html/2606.28795#S3.T1), residual\-tuning does seem to have removed most of the bias of kNN in this case\. Nevertheless, debiasing by the OOB\-tuned SRB\-kNN is seen to be beneficial here, now that it achieves unbiasedness as well as the lowest EMSE \(i\.e\.0\.0230\.023\) among all the alternatives\.
Thirdly, when it comes to bias and MSE estimation, the result is acceptable for the sample\-meany¯s\\bar\{y\}\_\{s\}, wheremse¯=0\.050\\overline\{\\text\{mse\}\}=0\.050compared toEMSE=0\.049\\text\{EMSE\}=0\.049\. For the SRB\-5NN, we havebias¯=−0\.696\\overline\{\\text\{bias\}\}=\-0\.696compared toEBias=−0\.795\\text\{EBias\}=\-0\.795andmse¯=0\.726\\overline\{\\text\{mse\}\}=0\.726toEMSE=0\.721\\text\{EMSE\}=0\.721\. For the OOB\-tuned SRB\-5NN that is unbiased,mse¯=0\.016\\overline\{\\text\{mse\}\}=0\.016is reasonable compared to itsESE2=0\.015\\text\{ESE\}^\{2\}=0\.015\. For the residual\-tuned SRB\-5NN, we havemse¯=0\.016\\overline\{\\text\{mse\}\}=0\.016, which is clearly lower than itsEMSE=0\.024\\text\{EMSE\}=0\.024but close tobias¯2\+ESE2=\(−0\.011\)2\+0\.1182=0\.014\\overline\{\\text\{bias\}\}^\{2\}\+\\text\{ESE\}^\{2\}=\(\-0\.011\)^\{2\}\+0\.118^\{2\}=0\.014\. This again suggests that its EBias here likely overstates its bias due to the Monte Carlo error\.
In summary, the simulation study here has demonstrated the potential bias caused by direct kNN\-prediction, such that debiasing by our theory may be beneficial\. Moreover, the associated inference of SRB\-prediction enables one to choose among the alternative predictors in practice\. Although residual\-tuning may be able to remove much of the bias, unbiased OOB\-tuning is preferable when its MSE is not larger than that of residual\-tuning\.
## 4Design\-based classification
We have so far considered design\-unbiased prediction\. Unit\-level classification of categorical outcomes may be of practical interest as well\. Below we consider specifically classification of binary outcomes\.
### 4\.1Out\-of\-sample classification accuracy
Letμ\(x,s1\)\\mu\(x,s\_\{1\}\)be the probability ofy=1y=1according toμ\(⋅\)\\mu\(\\cdot\)trained ons1s\_\{1\}, such as the kNN\-mean givenxxand donor sets1s\_\{1\}\. Lety^i\(s1\)\\hat\{y\}\_\{i\}\(s\_\{1\}\)be a binary classifier based onμ\(xi,s1\)\\mu\(x\_\{i\},s\_\{1\}\)for anyi∈Ui\\in U\. For instance,y^i\(s1\)=𝕀\(μ\(xi,s1\)\>0\.5\)\\hat\{y\}\_\{i\}\(s\_\{1\}\)=\\mathbb\{I\}\\big\(\\mu\(x\_\{i\},s\_\{1\}\)\>0\.5\\big\)deterministically, ory^i\(s1\)=𝕀\(ui≤μ\(xi,s1\)\)\\hat\{y\}\_\{i\}\(s\_\{1\}\)=\\mathbb\{I\}\\big\(u\_\{i\}\\leq\\mu\(x\_\{i\},s\_\{1\}\)\\big\)randomly givenui∼IIDUnif\(0,1\)u\_\{i\}\\overset\{\\text\{IID\}\}\{\\sim\}\\text\{Unif\}\(0,1\)whereuiu\_\{i\}is independently generated across the units for the purpose of classification\.
Let the*out\-of\-sample classification accuracy*ofy^i\(s1\)\\hat\{y\}\_\{i\}\(s\_\{1\}\)refer to its performance averaged over all the out\-of\-sample unitsR=U∖sR=U\\setminus s, which is defined as
ϕ\(R∣s1\)=1nRΦ\(R∣s1\)andΦ\(R∣s1\)=∑i∈R𝕀\(yi=y^i\(s1\)\)\\phi\(R\\mid s\_\{1\}\)=\\frac\{1\}\{n\_\{R\}\}\\Phi\(R\\mid s\_\{1\}\)\\quad\\text\{and\}\\quad\\Phi\(R\\mid s\_\{1\}\)=\\sum\_\{i\\in R\}\\mathbb\{I\}\\big\(y\_\{i\}=\\hat\{y\}\_\{i\}\(s\_\{1\}\)\\big\)\(11\)As with unbiased prediction \([5](https://arxiv.org/html/2606.28795#S2.E5)\), we leave it to future study to determine whether design\-unbiased classification or inference of unit\-specificyiy\_\{i\}can be achieved in a sensible manner\.
It follows that deterministic classification causes biased results generally\. For instance, suppose the population proportionPx=0\.2P\_\{x\}=0\.2givenxxis known, then𝕀\(Px≥ψ\)\\mathbb\{I\}\(P\_\{x\}\\geq\\psi\)is either 0 or 1 for*all*the units withxi=xx\_\{i\}=x, given any threshold valueψ∈\(0,1\)\\psi\\in\(0,1\), which is clearly biased\. Meanwhile, letNxN\_\{x\}be the number of units withxi=xx\_\{i\}=xin the population, then the randomised classifier𝕀\(ui≤Px\)\\mathbb\{I\}\(u\_\{i\}\\leq P\_\{x\}\)achieves unbiased classification, in the sense that
Eu\(1Nx∑i∈U:xi=x𝕀\(yi=𝕀\(ui≤Px\)\)\)=Px=1Nx∑i∈U:xi=xyiE\_\{u\}\\Big\(\\tfrac\{1\}\{N\_\{x\}\}\\sum\_\{i\\in U:x\_\{i\}=x\}\\mathbb\{I\}\\big\(y\_\{i\}=\\mathbb\{I\}\(u\_\{i\}\\leq P\_\{x\}\)\\big\)\\Big\)=P\_\{x\}=\\tfrac\{1\}\{N\_\{x\}\}\\sum\_\{i\\in U:x\_\{i\}=x\}y\_\{i\}where the expectation is with respect toui∼IIDUnif\(0,1\)u\_\{i\}\\overset\{\\text\{IID\}\}\{\\sim\}\\text\{Unif\}\(0,1\)\.
Givenμ\(x,s1\)\\mu\(x,s\_\{1\}\)instead ofPxP\_\{x\}in practice, one can estimateϕ\(R∣s1\)\\phi\(R\\mid s\_\{1\}\)conditional ons1s\_\{1\}based on the OOB classification errors\{𝕀\(yi=y^i\(s1\)\):i∈s2\}\\\{\\mathbb\{I\}\\big\(y\_\{i\}=\\hat\{y\}\_\{i\}\(s\_\{1\}\)\\big\):i\\in s\_\{2\}\\\}, just like estimating its prediction error based on\{μ\(xi,s1\)−yi:i∈s2\}\\\{\\mu\(x\_\{i\},s\_\{1\}\)\-y\_\{i\}:i\\in s\_\{2\}\\\}\. However, what is of interest is the classification accuracy associated with the corresponding SRB\-predictorμ¯\(xi,s\)\\bar\{\\mu\}\(x\_\{i\},s\), which is more efficient thanμ\(xi,s1\)\\mu\(x\_\{i\},s\_\{1\}\)\.
Let us consider directly the Monte Carlo implementation of SRB based onTTrandom splits,\(s1\(t\),s2\(t\)\)\(s\_\{1\}^\{\(t\)\},s\_\{2\}^\{\(t\)\}\)fort=1,…,Tt=1,\.\.\.,T\. This includes the exact SRB\-probabilityμ¯\(x,s\)\\bar\{\\mu\}\(x,s\)as the special case ofT=𝒞\(n,n1\)T=\\mathcal\{C\}\(n,n\_\{1\}\)distinct splits\. For anyi∈Ui\\in U, let
Ti=∑t=1T𝕀\(i∉s1\(t\)\)T\_\{i\}=\\sum\_\{t=1\}^\{T\}\\mathbb\{I\}\(i\\notin s\_\{1\}^\{\(t\)\}\)be the number of splits where the unit is out ofs1s\_\{1\}, whereTi=TT\_\{i\}=Tifi∉si\\notin s\. Let
μ̊\(xi,s\)=1Ti∑t=1Tμ\(xi,s1\(t\)\)𝕀\(i∉s1\(t\)\)\\mathring\{\\mu\}\(x\_\{i\},s\)=\\frac\{1\}\{T\_\{i\}\}\\sum\_\{t=1\}^\{T\}\\mu\(x\_\{i\},s\_\{1\}^\{\(t\)\}\)\\mathbb\{I\}\(i\\notin s\_\{1\}^\{\(t\)\}\)\(12\)for anyi∈Ui\\in U, which has the same expectation as the exact SRB\-probability\. It follows that, given a representative trainingpqpq\-design \([2](https://arxiv.org/html/2606.28795#S2.E2)\), we have
Epq\(μ̊\(xi,s\)∣i∈s2\)=Ep\(μ¯\(xi,s\)∣i∉s1\)=Epq\(μ̊\(xi,s\)∣i∉s\)E\_\{pq\}\\big\(\\mathring\{\\mu\}\(x\_\{i\},s\)\\mid i\\in s\_\{2\}\\big\)=E\_\{p\}\\big\(\\bar\{\\mu\}\(x\_\{i\},s\)\\mid i\\notin s\_\{1\}\\big\)=E\_\{pq\}\\big\(\\mathring\{\\mu\}\(x\_\{i\},s\)\\mid i\\notin s\\big\)\(13\)for anyi∈Ui\\in U\. Letẙi\(s\)\\mathring\{y\}\_\{i\}\(s\)be the classifier using the probabilityμ̊\(xi,s\)\\mathring\{\\mu\}\(x\_\{i\},s\)by \([12](https://arxiv.org/html/2606.28795#S4.E12)\)\. Let its*out\-of\-sample classification accuracy*be given similarly as \([11](https://arxiv.org/html/2606.28795#S4.E11)\),
ϕ̊\(R\)=1nRΦ̊\(R\)andΦ̊\(R\)=∑i∈R𝕀\(yi=ẙi\(s\)\)\\mathring\{\\phi\}\(R\)=\\frac\{1\}\{n\_\{R\}\}\\mathring\{\\Phi\}\(R\)\\quad\\text\{and\}\\quad\\mathring\{\\Phi\}\(R\)=\\sum\_\{i\\in R\}\\mathbb\{I\}\\big\(y\_\{i\}=\\mathring\{y\}\_\{i\}\(s\)\\big\)\(14\)
### 4\.2OOB estimation of classification accuracy
Let the unit\-specific classification accuracy using the probabilityμ̊\(xi,s\)\\mathring\{\\mu\}\(x\_\{i\},s\)be
α̊i=Epq\(𝕀\(yi=ẙi\(s\)\)∣i∉s1\)\\mathring\{\\alpha\}\_\{i\}=E\_\{pq\}\\Big\(\\mathbb\{I\}\\big\(y\_\{i\}=\\mathring\{y\}\_\{i\}\(s\)\\big\)\\mid i\\notin s\_\{1\}\\Big\)for anyi∈Ui\\in U\. In other words, for given unitiiwith fixedyiy\_\{i\},α̊i\\mathring\{\\alpha\}\_\{i\}measures the performance ofẙi\(s\)\\mathring\{y\}\_\{i\}\(s\)over repeated sampling conditional oni∉s1i\\notin s\_\{1\}\. Next, let us introduce a regularity condition onα̊i\\mathring\{\\alpha\}\_\{i\}as below\.
- \(M\) Given representative trainingpqpq\-design,α̊i\\mathring\{\\alpha\}\_\{i\}is a function given as α̊i=mi\(yi,xi,Epq\(μ̊\(xi,s\)∣i∉s1\)\)\\mathring\{\\alpha\}\_\{i\}=m\_\{i\}\\Big\(y\_\{i\},x\_\{i\},E\_\{pq\}\\big\(\\mathring\{\\mu\}\(x\_\{i\},s\)\\mid i\\notin s\_\{1\}\\big\)\\Big\)
Note that the functionmi\(⋅\)m\_\{i\}\(\\cdot\)is allowed to vary across the units\. The condition \(M\) effectively requiresα̊i\\mathring\{\\alpha\}\_\{i\}to be a constant given the arguments ofmi\(⋅\)m\_\{i\}\(\\cdot\)\.
By \([13](https://arxiv.org/html/2606.28795#S4.E13)\), we can exchangeEpq\(μ̊\(xi,s\)∣i∉s\)E\_\{pq\}\\big\(\\mathring\{\\mu\}\(x\_\{i\},s\)\\mid i\\notin s\\big\)andEpq\(μ̊\(xi,s\)∣i∈s2\)E\_\{pq\}\\big\(\\mathring\{\\mu\}\(x\_\{i\},s\)\\mid i\\in s\_\{2\}\\big\), such that the mappingmi\(⋅\)m\_\{i\}\(\\cdot\)of the condition \(M\) yields
α̊i=Epq\(𝕀\(yi=ẙi\(s\)\)∣i∈s2\)=Epq\(𝕀\(yi=ẙi\(s\)\)∣i∉s\)\.\\mathring\{\\alpha\}\_\{i\}=E\_\{pq\}\\Big\(\\mathbb\{I\}\\big\(y\_\{i\}=\\mathring\{y\}\_\{i\}\(s\)\\big\)\\mid i\\in s\_\{2\}\\Big\)=E\_\{pq\}\\Big\(\\mathbb\{I\}\\big\(y\_\{i\}=\\mathring\{y\}\_\{i\}\(s\)\\big\)\\mid i\\notin s\\Big\)\.\(15\)
Notice that the condition \(M\) is not trivial since, as the Lemma below shows: although each𝕀\(yi=ẙi\(s\)\)\\mathbb\{I\}\\big\(y\_\{i\}=\\mathring\{y\}\_\{i\}\(s\)\\big\)is a function ofμ̊\(xi,s\)\\mathring\{\\mu\}\(x\_\{i\},s\), apart from\(yi,xi\)\(y\_\{i\},x\_\{i\}\)obviously, it is not necessary thatα̊i\\mathring\{\\alpha\}\_\{i\}then only depends onEpq\(μ̊\(xi,s\)∣i∉s1\)E\_\{pq\}\\big\(\\mathring\{\\mu\}\(x\_\{i\},s\)\\mid i\\notin s\_\{1\}\\big\)\.
###### Lemma 2\.
The random classifierẙi\(s\)=𝕀\(ui≤μ̊\(xi,s\)\)\\mathring\{y\}\_\{i\}\(s\)=\\mathbb\{I\}\\big\(u\_\{i\}\\leq\\mathring\{\\mu\}\(x\_\{i\},s\)\\big\), whereui∼IIDUnif\(0,1\)u\_\{i\}\\overset\{\\text\{IID\}\}\{\\sim\}\\text\{Unif\}\(0,1\), satisfies the condition \(M\)\. The deterministic classifierẙi\(s\)=𝕀\(μ̊\(xi,s\)≥ψ\)\\mathring\{y\}\_\{i\}\(s\)=\\mathbb\{I\}\\big\(\\mathring\{\\mu\}\(x\_\{i\},s\)\\geq\\psi\\big\)does not satisfy the condition \(M\) given any threshold valueψ∈\(0,1\)\\psi\\in\(0,1\)\.
###### Proposition 3\.
Given a representative trainingpqpq\-design and a classifier satisfying the condition \(M\), apqpq\-unbiased predictor of the out\-of\-sample classification accuracyΦ̊\(R\)\\mathring\{\\Phi\}\(R\)in \([14](https://arxiv.org/html/2606.28795#S4.E14)\) is given as
Φ̊w\(s\)=∑i∈s\(1πi−1\)𝕀\(yi=ẙi\(s\)\)\.\\mathring\{\\Phi\}\_\{w\}\(s\)=\\sum\_\{i\\in s\}\\big\(\\tfrac\{1\}\{\\pi\_\{i\}\} \-1\\big\)\\mathbb\{I\}\\big\(y\_\{i\}=\\mathring\{y\}\_\{i\}\(s\)\\big\)~\.\(16\)
Notice that we do not estimate eachα̊i\\mathring\{\\alpha\}\_\{i\}specifically, but rather we work with their aggregates\. The result above includes the exact SRB\-probabilityμ¯\(xi,s\)\\bar\{\\mu\}\(x\_\{i\},s\)as a special case when it is feasible instead ofμ̊\(xi,s\)\\mathring\{\\mu\}\(x\_\{i\},s\)\.
Notice also that the use of in\-sample out\-of\-bag𝕀\(yi=ẙi\(s\)\)\\mathbb\{I\}\\big\(y\_\{i\}=\\mathring\{y\}\_\{i\}\(s\)\\big\)in \([16](https://arxiv.org/html/2606.28795#S4.E16)\) means that debiasingμ\(xi,s1\)\\mu\(x\_\{i\},s\_\{1\}\)by tuning needs to be restricted tos1s\_\{1\}, such as kNN based on donor sets11s\_\{11\}and tuned bys1∖s11s\_\{1\}\\setminus s\_\{11\}, wheres11⊂s1s\_\{11\}\\subset s\_\{1\}\. This is because tuning must not invalidate the ‘out\-of\-bag\-ness’ ofẙi\\mathring\{y\}\_\{i\}, i\.e\. tuning must be based on the units outside eachs2s\_\{2\}— hence, inside eachs1s\_\{1\}\.
## 5Illustrative application
We apply prediction tuning and classification accuracy estimation to a real dataset to demonstrate the theory and methods developed in this paper\.
### 5\.1Data and setup
LetN=104N=10^\{4\}satellite images be randomly selected from the CropHarvest \(HDF5\) dataset, each assignedy=1y=1in case of a coffee field ory=0y=0otherwise, associated with anxx\-vector of 216 features\. We treat these images as the populationUUwith associated constants\{\(yi,xi\):i=1,…,N\}\\\{\(y\_\{i\},x\_\{i\}\):i=1,\.\.\.,N\\\}over repeated sampling\.
Let kNN withk=5k=5be the ML algorithm, based on any of the three feature sets below:
- I\. \(x\_126, x\_144, x\_180, x\_190, x\_198\) by LASSO logistic regression;
- II\.ă\(x\_008, x\_025, x\_048, x\_129, x\_183\) by random selection;
- III\. \(x\_043, x\_162, x\_136, x\_063, x\_024\) by forward selection for 5NN\.
Consider the following 5NN\-based predictors for out\-of\-sampleyy\-classification\.
- •Sample\-5NNη\(x,s\)\\eta\(x,s\), given samplessof sizennselected by SRS\.
- •SRB\-5NNEq\(η\(x,s1\)∣s\)E\_\{q\}\\big\(\\eta\(x,s\_\{1\}\)\\mid s\\big\), with training sets1s\_\{1\}of sizen1n\_\{1\}by SRS fromss\.
- •OOB\-tuned SRB\-5NNEq\(ητ\(x,s1\)∣s\)E\_\{q\}\\big\(\\eta\_\{\\tau\}\(x,s\_\{1\}\)\\mid s\\big\), whereητ\(x,s1\)\\eta\_\{\\tau\}\(x,s\_\{1\}\)is thes11s\_\{11\}\-5NN that is tuned by out\-of\-s11s\_\{11\}errors ins1∖s11s\_\{1\}\\setminus s\_\{11\}, givens11s\_\{11\}of sizen11n\_\{11\}by SRS froms1s\_\{1\}, i\.e\. ητ\(x,s1\)=η\(x,s11\)−1n1−n11∑i∈s1∖s11\(η\(xi,s11\)−yi\)\\eta\_\{\\tau\}\(x,s\_\{1\}\)=\\eta\(x,s\_\{11\}\)\-\\tfrac\{1\}\{n\_\{1\}\-n\_\{11\}\}\\sum\_\{i\\in s\_\{1\}\\setminus s\_\{11\}\}\\big\(\\eta\(x\_\{i\},s\_\{11\}\)\-y\_\{i\}\\big\)
We use simulations to evaluate the performance of design\-based debiasing and classification accuracy prediction\. Letn=200n=200be the size of samplessby SRS fromUU, letn1=120n\_\{1\}=120be the size ofs1s\_\{1\}for SRB given by SRS fromss, and letn11=80n\_\{11\}=80be the size ofs11s\_\{11\}for the OOB\-tuned SRB given by SRS froms1s\_\{1\}\. LetT=100T=100for Monte Carlo SRB, such that, for anyk∈Uk\\in U, the SRB\-5NN predictor is given by \([12](https://arxiv.org/html/2606.28795#S4.E12)\) generically, and the OOB\-tuned SRB\-5NN predictor is given by
η̊τ\(xk,s\)=1Tk∑t=1T𝕀\(k∉s1\(t\)\)\{η\(xk,s11\(t\)\)−1n1−n11∑i∈s1\(t\)∖s11\(t\)\(η\(xi,s11\(t\)\)−yi\)\}\.\\mathring\{\\eta\}\_\{\\tau\}\(x\_\{k\},s\)=\\frac\{1\}\{T\_\{k\}\}\\sum\_\{t=1\}^\{T\}\\mathbb\{I\}\(k\\notin s\_\{1\}^\{\(t\)\}\)\\Big\\\{\\eta\(x\_\{k\},s\_\{11\}^\{\(t\)\}\)\-\\tfrac\{1\}\{n\_\{1\}\-n\_\{11\}\}\\sum\_\{i\\in s\_\{1\}^\{\(t\)\}\\setminus s\_\{11\}^\{\(t\)\}\}\\big\(\\eta\(x\_\{i\},s\_\{11\}^\{\(t\)\}\)\-y\_\{i\}\\big\)\\Big\\\}\.Finally, for the classification accuracy \([14](https://arxiv.org/html/2606.28795#S4.E14)\) by randomised classifier associated with a given OOB SRB\-5NN probability, denoted byη̊\(xi,s\)\\mathring\{\\eta\}\(x\_\{i\},s\), we can replace the random variable𝕀\(yi=ẙi\(s\)\)\\mathbb\{I\}\\big\(y\_\{i\}=\\mathring\{y\}\_\{i\}\(s\)\\big\)in \([16](https://arxiv.org/html/2606.28795#S4.E16)\) by
pi=𝕀\(yi=1\)η̊\(xi,s\)\+𝕀\(yi=0\)\(1−η̊\(xi,s\)\)\.p\_\{i\}=\\mathbb\{I\}\(y\_\{i\}=1\)\\mathring\{\\eta\}\(x\_\{i\},s\)\+\\mathbb\{I\}\(y\_\{i\}=0\)\\big\(1\-\\mathring\{\\eta\}\(x\_\{i\},s\)\\big\)~\.This is more efficient than using the randomẙi\(s\)=𝕀\(ui≤η̊\(xi,s\)\)\\mathring\{y\}\_\{i\}\(s\)=\\mathbb\{I\}\\big\(u\_\{i\}\\leq\\mathring\{\\eta\}\(x\_\{i\},s\)\\big\), which is subject to an additional variance due toui∼IIDUnif\(0,1\)u\_\{i\}\\overset\{\\text\{IID\}\}\{\\sim\}\\text\{Unif\}\(0,1\)\.
### 5\.2Results
Denote byBBthe number of simulations, each corresponding to a different samplessfromUU\. LetP^\(b\)\\hat\{P\}^\{\(b\)\}be a realised prediction estimate of the population meanPP, as defined in Section[3](https://arxiv.org/html/2606.28795#S3), whereb=1,…,Bb=1,\.\.\.,B\. Let
EBias=1B∑b=1B\(P^\(b\)−P\)withV\(EBias\)=1B\(B−1\)∑b=1B\(P^\(b\)−P\)2\\displaystyle\\text\{EBias\}=\\frac\{1\}\{B\}\\sum\_\{b=1\}^\{B\}\(\\hat\{P\}^\{\(b\)\}\-P\)\\quad\\text\{with\}\\quad\\text\{V\(EBias\)\}=\\frac\{1\}\{B\(B\-1\)\}\\sum\_\{b=1\}^\{B\}\(\\hat\{P\}^\{\(b\)\}\-P\)^\{2\}andEVar=1B−1∑b=1B\{P^\(b\)−1B∑b=1BP^\(b\)\}2\.\\displaystyle\\quad\\text\{and\}\\quad\\text\{EVar\}=\\frac\{1\}\{B\-1\}\\sum\_\{b=1\}^\{B\}\\\{\\hat\{P\}^\{\(b\)\}\-\\tfrac\{1\}\{B\}\\sum\_\{b=1\}^\{B\}\\hat\{P\}^\{\(b\)\}\\\}^\{2\}~\.
Table 2:Debiasing results, feature set \(I\)–\(III\),B=100B=100simulationsTable[2](https://arxiv.org/html/2606.28795#S5.T2)shows the results for debiasing, based onB=100B=100simulations\. Firstly, given any feature set, the OOB\-tuned SRB\-5NN is essentially unbiased for the population mean: its EBias is clearly closer to 0 than the untuned SRB\-5NN and not significantly different from 0 in light of V\(EBias\)\. This confirms again the theory of design\-based debiasing developed in Section[2](https://arxiv.org/html/2606.28795#S2)\.
Secondly, the bias of kNN prediction depends on the selected features and their sample distribution\. In this case, the randomly selected feature set \(II\) causes very little bias to the standard sample\-5NN, which is possible if it leads to ‘random’ donor imputation under SRS\. However, such ‘accidental’ unbiasedness can come at a price of increased variance, which is evident if one compares the EVar of sample\-5NN given either feature set \(II\) or \(III\)\. We notice that the results of feature set \(II\) are included here only for instructive purposes, since one is unlikely to use randomly selected features for kNN in practice\.
Thirdly, feature selection is an integral part of algorithmic ML\. One can see this most clearly in the poor results given feature set \(I\)\. Although it may be a reasonable choice for logistic regression, using these features for kNN can increase both the prediction bias and variance\. In this case, the EBias of sample\-5NN \(or SRB\-5NN\) is larger thanEVar\\sqrt\{\\text\{EVar\}\}and the EVar is even larger than that given the randomly selected feature set \(II\)\.
Finally, given feature set \(III\) for kNN, the EBias of directly applying 5NN is small but still appreciable compared toEVar\\sqrt\{\\text\{EVar\}\}\. Since the empirical MSE of the OOB\-tuned SRB\-5NN is smaller than either the untuned 5NN predictors, debiasing is worthwhile for out\-of\-sample prediction here\.
Table 3:Mean over simulations \(B=100B=100\) of classification accuracyϕ̊\(R\)\\mathring\{\\phi\}\(R\), its predictorϕ^\(R\)\\hat\{\\phi\}\(R\), and associated squared error\{ϕ^\(R\)−ϕ̊\(R\)\}2\\\{\\hat\{\\phi\}\(R\)\-\\mathring\{\\phi\}\(R\)\\\}^\{2\}When it comes to the classification accuracyϕ̊\(R\)\\mathring\{\\phi\}\(R\)associated with a given SRB\-5NN predictor, as defined by \([14](https://arxiv.org/html/2606.28795#S4.E14)\), letϕ^\(R\)=nR−1Φ̊w\(s\)\\hat\{\\phi\}\(R\)=n\_\{R\}^\{\-1\}\\mathring\{\\Phi\}\_\{w\}\(s\)according to \([16](https://arxiv.org/html/2606.28795#S4.E16)\)\. Table[3](https://arxiv.org/html/2606.28795#S5.T3)gives the results based onB=100B=100simulations, where the EBias of classification accuracy prediction is the difference between columnsϕ̊\(R\)\\mathring\{\\phi\}\(R\)andϕ^\(R\)\\hat\{\\phi\}\(R\), and the last column shows its empirical MSE over the simulations\. It is evident that Proposition[3](https://arxiv.org/html/2606.28795#Thmproposition3)achieves unbiased prediction of the out\-of\-sample classification accuracy in every situation\.
Notice that, given either feature set \(I\) or \(II\), the classification accuracy of the unbiased OOB\-tuned SRB\-5NN is clearly reduced compare to the biased SRB\-5NN\. The reason is that tuning can cause the predicted probabilities to be out of the bounds\[0,1\]\[0,1\], which then leads to truncation of probabilities and loss of classification accuracy, as illustrated by the example below\.
###### Example 3\.
Suppose two out\-of\-sample units with\(y1,y2\)=\(1,0\)\(y\_\{1\},y\_\{2\}\)=\(1,0\)\. Suppose a given predictor yields\(μ1,μ2\)=\(0\.95,0\.07\)\(\\mu\_\{1\},\\mu\_\{2\}\)=\(0\.95,0\.07\), such that the probability of correct randomised classification is12\(0\.95\+0\.93\)=0\.94\\tfrac\{1\}\{2\}\(0\.95\+0\.93\)=0\.94\. Givenμ1\+μ2=1\.02≠y1\+y2\\mu\_\{1\}\+\\mu\_\{2\}=1\.02\\neq y\_\{1\}\+y\_\{2\}, tuning is possible, which yields, say,\(μ̊1,μ̊2\)=\(μ1,μ2\)\+ϵ\(\\mathring\{\\mu\}\_\{1\},\\mathring\{\\mu\}\_\{2\}\)=\(\\mu\_\{1\},\\mu\_\{2\}\)\+\\epsilon\. As long as\(μ̊1,μ̊2\)\(\\mathring\{\\mu\}\_\{1\},\\mathring\{\\mu\}\_\{2\}\)are within the bounds\[0,1\]\[0,1\], the probability of correct randomised classification is unaffected, since12\(μ1\+ϵ\+1−μ2−ϵ\)=12\(μ1\+1−μ2\)\\tfrac\{1\}\{2\}\(\\mu\_\{1\}\+\\epsilon\+1\-\\mu\_\{2\}\-\\epsilon\)=\\tfrac\{1\}\{2\}\(\\mu\_\{1\}\+1\-\\mu\_\{2\}\)\. However, it is possible to obtain out\-of\-bound probabilities, say,\(μ̊1,μ̊2\)=\(1\.04,0\.16\)\(\\mathring\{\\mu\}\_\{1\},\\mathring\{\\mu\}\_\{2\}\)=\(1\.04,0\.16\), in which case the probability of correct randomised classification would be reduced to12\(1\+0\.84\)=0\.92\\tfrac\{1\}\{2\}\(1\+0\.84\)=0\.92\.
Meanwhile, given the feature set \(III\) selected for kNN, the drop of classification accuracy by debiasing is much less pronounced, i\.e\. 0\.987 compared to 0\.989, and the classification accuracy is higher compared to using the feature set \(I\) or \(II\)\. In other words, not only does sensible feature selection improve the classification accuracy, it can also reduce the risk of out\-of\-bounds tuning for the purpose of unbiased population prediction estimation\.
## 6Summary of findings and future research
ML must use a training dataset to form the prediction model or algorithm\. It would be difficult to relate its in\-sample out\-of\-bag prediction error of a given unit to its out\-of\-sample prediction error of the same unit, if the*expected*prediction would differ depending on whether the unit is in or out of the sample\. The concept of representative training articulates this intuition in terms of how the training set can be selected from the relevant finite population, whether it consists of persons, businesses, physical or spatial objects\.
Notice that the same intuition has always existed in IID\-model inference, such as cross\-validation, or out\-of\-bag accuracy for random forests\. Since one can obtain IID samples by with\-replacement sampling from a given population, the concept of representative training can be viewed as a generalisation of these IID\-model inference ideas to finite population inference, where the units may be selected with unequal probabilities and without replacement\.
Moreover, representative training provides generally a means for tuning a trained algorithm, based on its in\-sample out\-of\-bag prediction errors, so as to achieve design\-unbiased out\-of\-sample prediction, i\.e\. over repeated sampling\-and\-training under thepqpq\-design\. A topic for future study may be to investigate representative training for interval estimation so as to achieve exact design\-based coverage as it was defined by Neyman \(1934\)\.
In addition, we have shown that out\-of\-sample unit\-level classification of categorical outcomes can yield unbiased prediction estimation of population totals, and the associated classification accuracy can be assessed unbiasedly given representative training by thepqpq\-design\.
However, one should keep in mind the distinctive needs of population\-level estimation and unit\-level classification\. For instance, in many applications of ML, such as official statistics, population estimation is always required, but unit classification is only sometimes\. If unbiasedness of population estimation is necessary, then one may have to accept a reduced classification accuracy\. This seems therefore a natural topic for future research, i\.e\. how to implement unit\-level classification subject to unbiased population prediction estimation, such that the classification accuracy can be maximised\.
Finally, looking back on the groundbreaking idea of Neyman \(1934\) to finite population sampling, which consists of a “representative” method of sampling and an associated “consistent” method of estimation, one may view representative trainingpqpq\-designs as an extension of the concept of “representative” sampling, providing an intuitive basis for broadening the scope of associated “consistent” ML estimation by predicting or classifying the out\-of\-sample units\. We hope this general outlook on design\-unbiased finite population inference may be useful to many practitioners of ML\.
## Appendix AProofs
Lemma[1](https://arxiv.org/html/2606.28795#Thmlemma1)\.
###### Proof\.
For the proof here, writeπ2i\(s1\)\\pi\_\{2i\}\(s\_\{1\}\)forπ2i\\pi\_\{2i\}to emphasise it as a function ofs1s\_\{1\}\. Given a well\-definedpqpq\-design, we haveEpq\(⋅\)=Es1s2\(⋅\)E\_\{pq\}\(\\cdot\)=E\_\{s\_\{1\}s\_\{2\}\}\(\\cdot\), such that
\([2](https://arxiv.org/html/2606.28795#S2.E2)\)⇔∑s1∌iμ\(xi,s1\)f\(s1\)Pr\(i∈s2∣s1\)Pr\(i∈s2\)=∑s1∌iμ\(xi,s1\)f\(s1\)Pr\(i∉s2∣s1\)Pr\(i∉s\)\\displaystyle\\eqref\{RT\}\\quad\\Leftrightarrow\\quad\\sum\_\{s\_\{1\}\\not\\ni i\}\\mu\(x\_\{i\},s\_\{1\}\)\\frac\{f\(s\_\{1\}\)\\Pr\(i\\in s\_\{2\}\\mid s\_\{1\}\)\}\{\\Pr\(i\\in s\_\{2\}\)\}=\\sum\_\{s\_\{1\}\\not\\ni i\}\\mu\(x\_\{i\},s\_\{1\}\)\\frac\{f\(s\_\{1\}\)\\Pr\(i\\notin s\_\{2\}\\mid s\_\{1\}\)\}\{\\Pr\(i\\notin s\)\}⇔∑s1∌iμ\(xi,s1\)f\(s1\)π2i\(s1\)πi−Pr\(i∈s1\)=∑s1∌iμ\(xi,s1\)f\(s1\)1−π2i\(s1\)1−πi\\displaystyle\\quad\\Leftrightarrow\\quad\\sum\_\{s\_\{1\}\\not\\ni i\}\\mu\(x\_\{i\},s\_\{1\}\)f\(s\_\{1\}\)\\frac\{\\pi\_\{2i\}\(s\_\{1\}\)\}\{\\pi\_\{i\}\-\\Pr\(i\\in s\_\{1\}\)\}=\\sum\_\{s\_\{1\}\\not\\ni i\}\\mu\(x\_\{i\},s\_\{1\}\)f\(s\_\{1\}\)\\frac\{1\-\\pi\_\{2i\}\(s\_\{1\}\)\}\{1\-\\pi\_\{i\}\}⇔∑s1∌iμ\(xi,s1\)f\(s1\)π2i\(s1\)1−Pr\(i∈s1\)πi−Pr\(i∈s1\)=∑s1∌iμ\(xi,s1\)f\(s1\)\.\\displaystyle\\quad\\Leftrightarrow\\quad\\sum\_\{s\_\{1\}\\not\\ni i\}\\mu\(x\_\{i\},s\_\{1\}\)f\(s\_\{1\}\)\\pi\_\{2i\}\(s\_\{1\}\)\\frac\{1\-\\Pr\(i\\in s\_\{1\}\)\}\{\\pi\_\{i\}\-\\Pr\(i\\in s\_\{1\}\)\}=\\sum\_\{s\_\{1\}\\not\\ni i\}\\mu\(x\_\{i\},s\_\{1\}\)f\(s\_\{1\}\)\.\(17\)First, the equality \([17](https://arxiv.org/html/2606.28795#A1.E17)\) holds for*all possible*μ\(xi,s1\)\\mu\(x\_\{i\},s\_\{1\}\)if and only if
π2i\(s1\)1−Pr\(i∈s1\)πi−Pr\(i∈s1\)=1⇔\([3](https://arxiv.org/html/2606.28795#S2.E3)\)\\pi\_\{2i\}\(s\_\{1\}\)\\frac\{1\-\\Pr\(i\\in s\_\{1\}\)\}\{\\pi\_\{i\}\-\\Pr\(i\\in s\_\{1\}\)\}=1\\quad\\Leftrightarrow\\quad\\eqref\{RT:all\}Next, multiplying both the sides of \([17](https://arxiv.org/html/2606.28795#A1.E17)\) byπi−Pr\(i∈s1\)\{1−Pr\(i∈s1\)\}2\\tfrac\{\\pi\_\{i\}\-\\Pr\(i\\in s\_\{1\}\)\}\{\\\{1\-\\Pr\(i\\in s\_\{1\}\)\\\}^\{2\}\}, the left\-hand side yields
∑s1∌iμ\(xi,s1\)π2i\(s1\)f\(s1\)1−Pr\(i∈s1\)=Es1\(μ\(xi,s1\)π2i\(s1\)∣i∉s1\)\\sum\_\{s\_\{1\}\\not\\ni i\}\\mu\(x\_\{i\},s\_\{1\}\)\\pi\_\{2i\}\(s\_\{1\}\)\\frac\{f\(s\_\{1\}\)\}\{1\-\\Pr\(i\\in s\_\{1\}\)\}=E\_\{s\_\{1\}\}\\big\(\\mu\(x\_\{i\},s\_\{1\}\)\\pi\_\{2i\}\(s\_\{1\}\)\\mid i\\notin s\_\{1\}\\big\)while the right\-hand side can be written as
\(∑s1∌iμ\(xi,s1\)f\(s1\)1−Pr\(i∈s1\)\)\(Pr\(i∈s2\)1−Pr\(i∈s1\)\)=Es1\(μ\(xi,s1\)∣i∉s1\)\(∑s1∌iπi2\(s1\)f\(s1\)1−Pr\(i∈s1\)\)\\displaystyle\\Big\(\\sum\_\{s\_\{1\}\\not\\ni i\}\\mu\(x\_\{i\},s\_\{1\}\)\\tfrac\{f\(s\_\{1\}\)\}\{1\-\\Pr\(i\\in s\_\{1\}\)\}\\Big\)\\Big\(\\tfrac\{\\Pr\(i\\in s\_\{2\}\)\}\{1\-\\Pr\(i\\in s\_\{1\}\)\}\\Big\)=E\_\{s\_\{1\}\}\\big\(\\mu\(x\_\{i\},s\_\{1\}\)\\mid i\\notin s\_\{1\}\\big\)\\Big\(\\sum\_\{s\_\{1\}\\not\\ni i\}\\pi\_\{i2\}\(s\_\{1\}\)\\tfrac\{f\(s\_\{1\}\)\}\{1\-\\Pr\(i\\in s\_\{1\}\)\}\\Big\)=Es1\(μ\(xi,s1\)∣i∉s1\)Es1\(π2i\(s1\)∣i∉s1\)\\displaystyle=E\_\{s\_\{1\}\}\\big\(\\mu\(x\_\{i\},s\_\{1\}\)\\mid i\\notin s\_\{1\}\\big\)E\_\{s\_\{1\}\}\\big\(\\pi\_\{2i\}\(s\_\{1\}\)\\mid i\\notin s\_\{1\}\\big\)i\.e\. representative training of any givenμ\(x,s1\)\\mu\(x,s\_\{1\}\)if and only if \([4](https://arxiv.org/html/2606.28795#S2.E4)\), including whenπ2i\(s1\)\\pi\_\{2i\}\(s\_\{1\}\)varies withs1s\_\{1\}and is not given by \([3](https://arxiv.org/html/2606.28795#S2.E3)\)\. ∎
Corollary[1](https://arxiv.org/html/2606.28795#Thmcoro1)\.
###### Proof\.
Given sample sizesnnofssandn1n\_\{1\}ofs1s\_\{1\}, we have𝒞\(N−1,n1\)=\(N−1\)\!n1\!\(N−n1−1\)\!\\mathcal\{C\}\(N\-1,n\_\{1\}\)=\\frac\{\(N\-1\)\!\}\{n\_\{1\}\!\(N\-n\_\{1\}\-1\)\!\}distincts1s\_\{1\}on either side of \([2](https://arxiv.org/html/2606.28795#S2.E2)\), andPr\(i∈s2∣s1\)=n−n1N−n1𝕀\(i∉s1\)\\Pr\(i\\in s\_\{2\}\\mid s\_\{1\}\)=\\tfrac\{n\-n\_\{1\}\}\{N\-n\_\{1\}\}\\mathbb\{I\}\(i\\notin s\_\{1\}\)\. ∎
Corollary[2](https://arxiv.org/html/2606.28795#Thmcoro2)\.
###### Proof\.
Letπi\\pi\_\{i\}be the Poisson sampling probability ofi∈Ui\\in U, andλ\\lambdathat of Bernoulli sampling fromss\. We havef\(s1\)=∏j∈s1λπj∏l∉s1\(1−λπl\)f\(s\_\{1\}\)=\\prod\_\{j\\in s\_\{1\}\}\\lambda\\pi\_\{j\}\\prod\_\{l\\notin s\_\{1\}\}\(1\-\\lambda\\pi\_\{l\}\)andPr\(i∈s1\)=λπi\\Pr\(i\\in s\_\{1\}\)=\\lambda\\pi\_\{i\}\. Moreover, we havePr\(i∈s2∣s1\)=πi−λπi1−λπi𝕀\(i∉s1\)\\Pr\(i\\in s\_\{2\}\\mid s\_\{1\}\)=\\tfrac\{\\pi\_\{i\}\-\\lambda\\pi\_\{i\}\}\{1\-\\lambda\\pi\_\{i\}\}\\mathbb\{I\}\(i\\notin s\_\{1\}\)for eachi∈Ui\\in U\. ∎
Proposition[1](https://arxiv.org/html/2606.28795#Thmproposition1)\.
###### Proof\.
Regarding the right\-hand sides in \([6](https://arxiv.org/html/2606.28795#S2.E6)\) and \([5](https://arxiv.org/html/2606.28795#S2.E5)\), we have
Ep\(∑i∈s\(1πi−1\)yi\)=∑i∈Uyi−Ep\(∑i∈syi\)=Ep\(∑i∈Ryi\)\.E\_\{p\}\\Big\(\\sum\_\{i\\in s\}\\big\(\\tfrac\{1\}\{\\pi\_\{i\}\}\-1\\big\)y\_\{i\}\\Big\)=\\sum\_\{i\\in U\}y\_\{i\}\-E\_\{p\}\\big\(\\sum\_\{i\\in s\}y\_\{i\}\\big\)=E\_\{p\}\\big\(\\sum\_\{i\\in R\}y\_\{i\}\\big\)\.Meanwhile, regarding the left\-hand sides in \([6](https://arxiv.org/html/2606.28795#S2.E6)\) and \([5](https://arxiv.org/html/2606.28795#S2.E5)\), we have
Epq\(∑i∈s\(1πi−1\)μ\(xi,s1\)∣i∈s2\)\\displaystyle E\_\{pq\}\\Big\(\\sum\_\{i\\in s\}\\big\(\\tfrac\{1\}\{\\pi\_\{i\}\}\-1\\big\)\\mu\(x\_\{i\},s\_\{1\}\)\\mid i\\in s\_\{2\}\\Big\)=Epq\(∑i∈U𝕀\(i∈s\)πi\(1−πi\)μ\(xi,s1\)∣i∈s2\)\\displaystyle=E\_\{pq\}\\Big\(\\sum\_\{i\\in U\}\\tfrac\{\\mathbb\{I\}\(i\\in s\)\}\{\\pi\_\{i\}\}\(1\-\\pi\_\{i\}\)\\mu\(x\_\{i\},s\_\{1\}\)\\mid i\\in s\_\{2\}\\Big\)=∑i∈UEpq\(\(1−πi\)μ\(xi,s1\)∣i∈s2,i∈s\)\\displaystyle=\\sum\_\{i\\in U\}E\_\{pq\}\\Big\(\(1\-\\pi\_\{i\}\)\\mu\(x\_\{i\},s\_\{1\}\)\\mid i\\in s\_\{2\},i\\in s\\Big\)=∑i∈U\(1−πi\)Epq\(μ\(xi,s1\)∣i∈s2\)\\displaystyle=\\sum\_\{i\\in U\}\(1\-\\pi\_\{i\}\)E\_\{pq\}\\big\(\\mu\(x\_\{i\},s\_\{1\}\)\\mid i\\in s\_\{2\}\\big\)=\([2](https://arxiv.org/html/2606.28795#S2.E2)\)∑i∈U\(1−πi\)Epq\(μ\(xi,s1\)∣i∉s\)\\displaystyle\\overset\{\\eqref\{RT\}\}\{=\}\\sum\_\{i\\in U\}\(1\-\\pi\_\{i\}\)E\_\{pq\}\\big\(\\mu\(x\_\{i\},s\_\{1\}\)\\mid i\\notin s\\big\)=Epq\(∑i∈Rμ\(xi,s1\)\)\\displaystyle=E\_\{pq\}\\big\(\\sum\_\{i\\in R\}\\mu\(x\_\{i\},s\_\{1\}\)\\big\)given representative trainingpqpq\-design\. Thus, the condition \([6](https://arxiv.org/html/2606.28795#S2.E6)\) ensures \([5](https://arxiv.org/html/2606.28795#S2.E5)\)\. ∎
Proposition[2](https://arxiv.org/html/2606.28795#Thmproposition2)\.
###### Proof\.
Since the total out\-of\-sample prediction bias ofμ¯\(xi,s\)\\bar\{\\mu\}\(x\_\{i\},s\)is given by the expectation of the difference between the two sides of \([6](https://arxiv.org/html/2606.28795#S2.E6)\), one can apportion1/nR1/n\_\{R\}of this difference to eachi∉si\\notin sas app\-unbiased correction\. The expression \([9](https://arxiv.org/html/2606.28795#S2.E9)\) follows, since
Eq\(∑i∈s21q2i\(1πi−1\)μ\(xi,s1\)\)=∑i∈s\(1πi−1\)Eq\(μ\(xi,s1\)∣i∈s2\)\.∎E\_\{q\}\\Big\(\\sum\_\{i\\in s\_\{2\}\}\\tfrac\{1\}\{q\_\{2i\}\}\\big\(\\tfrac\{1\}\{\\pi\_\{i\}\}\-1\\big\)\\mu\(x\_\{i\},s\_\{1\}\)\\Big\)=\\sum\_\{i\\in s\}\\big\(\\tfrac\{1\}\{\\pi\_\{i\}\}\-1\\big\)E\_\{q\}\\big\(\\mu\(x\_\{i\},s\_\{1\}\)\\mid i\\in s\_\{2\}\\big\)\.\\qed
Lemma[2](https://arxiv.org/html/2606.28795#Thmlemma2)\.
###### Proof\.
For the random classifier, we have
𝕀\(yi=1=ẙi\(s\)\)=\{1ifui≤μ̊\(xi,s\)0ifui\>μ̊\(xi,s\)⇒α̊i=Epq\(μ̊\(xi,s\)∣i∉s1\)ifyi=1\\displaystyle\\mathbb\{I\}\\big\(y\_\{i\}=1=\\mathring\{y\}\_\{i\}\(s\)\\big\)=\\begin\{cases\}1&\\text\{if\}~u\_\{i\}\\leq\\mathring\{\\mu\}\(x\_\{i\},s\)\\\\ 0&\\text\{if\}~u\_\{i\}\>\\mathring\{\\mu\}\(x\_\{i\},s\)\\end\{cases\}~\\Rightarrow~\\mathring\{\\alpha\}\_\{i\}=E\_\{pq\}\\big\(\\mathring\{\\mu\}\(x\_\{i\},s\)\\mid i\\notin s\_\{1\}\\big\)\\text\{ if \}y\_\{i\}=1𝕀\(yi=0=ẙi\(s\)\)=\{1ifui\>μ̊\(xi,s\)0ifui≤μ̊\(xi,s\)⇒α̊i=1−Epq\(μ̊\(xi,s\)∣i∉s1\)ifyi=0\\displaystyle\\mathbb\{I\}\\big\(y\_\{i\}=0=\\mathring\{y\}\_\{i\}\(s\)\\big\)=\\begin\{cases\}1&\\text\{if\}~u\_\{i\}\>\\mathring\{\\mu\}\(x\_\{i\},s\)\\\\ 0&\\text\{if\}~u\_\{i\}\\leq\\mathring\{\\mu\}\(x\_\{i\},s\)\\end\{cases\}~\\Rightarrow~\\mathring\{\\alpha\}\_\{i\}=1\-E\_\{pq\}\\big\(\\mathring\{\\mu\}\(x\_\{i\},s\)\\mid i\\notin s\_\{1\}\\big\)\\text\{ if \}y\_\{i\}=0such thatα̊i\\mathring\{\\alpha\}\_\{i\}is fully determined givenyiy\_\{i\},xix\_\{i\}andEpq\(μ̊\(xi,s\)∣i∉s1\)E\_\{pq\}\\big\(\\mathring\{\\mu\}\(x\_\{i\},s\)\\mid i\\notin s\_\{1\}\\big\)\.
Whereas, for the deterministic classifier, we have
𝕀\(yi=1=ẙi\(s\)\)=\{1ifμ̊\(xi,s\)≥ψ0otherwise⇒α̊i=Pr\(μ̊\(xi,s\)≥ψ∣i∉s1\)ifyi=1\\displaystyle\\mathbb\{I\}\\big\(y\_\{i\}=1=\\mathring\{y\}\_\{i\}\(s\)\\big\)=\\begin\{cases\}1&\\text\{if\}~\\mathring\{\\mu\}\(x\_\{i\},s\)\\geq\\psi\\\\ 0&\\text\{otherwise\}\\end\{cases\}~\\Rightarrow~\\mathring\{\\alpha\}\_\{i\}=\\Pr\\big\(\\mathring\{\\mu\}\(x\_\{i\},s\)\\geq\\psi\\mid i\\notin s\_\{1\}\\big\)\\text\{ if \}y\_\{i\}=1𝕀\(yi=0=ẙi\(s\)\)=\{1ifμ̊\(xi,s\)<ψ0otherwise⇒α̊i=Pr\(μ̊\(xi,s\)<ψ∣i∉s1\)ifyi=0\\displaystyle\\mathbb\{I\}\\big\(y\_\{i\}=0=\\mathring\{y\}\_\{i\}\(s\)\\big\)=\\begin\{cases\}1&\\text\{if\}~\\mathring\{\\mu\}\(x\_\{i\},s\)<\\psi\\\\ 0&\\text\{otherwise\}\\end\{cases\}~\\Rightarrow~\\mathring\{\\alpha\}\_\{i\}=\\Pr\\big\(\\mathring\{\\mu\}\(x\_\{i\},s\)<\\psi\\mid i\\notin s\_\{1\}\\big\)\\text\{ if \}y\_\{i\}=0such thatα̊i\\mathring\{\\alpha\}\_\{i\}depends on the variance ofμ̊\(xi,s\)\\mathring\{\\mu\}\(x\_\{i\},s\)as well asEpq\(μ̊\(xi,s\)∣i∉s1\)E\_\{pq\}\\big\(\\mathring\{\\mu\}\(x\_\{i\},s\)\\mid i\\notin s\_\{1\}\\big\)\. ∎
Proposition[3](https://arxiv.org/html/2606.28795#Thmproposition3)\.
###### Proof\.
We have
Epq\(Φ̊\(R\)\)\\displaystyle E\_\{pq\}\\big\(\\mathring\{\\Phi\}\(R\)\\big\)=∑i∈UEpq\(𝕀\(i∉s\)𝕀\(yi=ẙi\(s\)\)\)\\displaystyle=\\sum\_\{i\\in U\}E\_\{pq\}\\Big\(\\mathbb\{I\}\(i\\notin s\)\\mathbb\{I\}\\big\(y\_\{i\}=\\mathring\{y\}\_\{i\}\(s\)\\big\)\\Big\)=∑i∈U\(1−πi\)Epq\(𝕀\(yi=ẙi\(s\)\)∣i∉s\)=\([15](https://arxiv.org/html/2606.28795#S4.E15)\)∑i∈U\(1−πi\)α̊i\\displaystyle=\\sum\_\{i\\in U\}\(1\-\\pi\_\{i\}\)E\_\{pq\}\\Big\(\\mathbb\{I\}\\big\(y\_\{i\}=\\mathring\{y\}\_\{i\}\(s\)\\big\)\\mid i\\notin s\\Big\)\\overset\{\\eqref\{M\-pq\}\}\{=\}\\sum\_\{i\\in U\}\(1\-\\pi\_\{i\}\)\\mathring\{\\alpha\}\_\{i\}Epq\(Φ̊w\(s\)\)\\displaystyle E\_\{pq\}\\big\(\\mathring\{\\Phi\}\_\{w\}\(s\)\\big\)=∑i∈UEpq\(𝕀\(i∈s\)\)\(1πi−1\)Epq\(𝕀\(yi=ẙi\(s\)\)∣i∈s2,i∈s\)\\displaystyle=\\sum\_\{i\\in U\}E\_\{pq\}\\big\(\\mathbb\{I\}\(i\\in s\)\\big\)\\big\(\\tfrac\{1\}\{\\pi\_\{i\}\} \-1\\big\)E\_\{pq\}\\Big\(\\mathbb\{I\}\\big\(y\_\{i\}=\\mathring\{y\}\_\{i\}\(s\)\\big\)\\mid i\\in s\_\{2\},i\\in s\\Big\)=∑i∈U\(1−πi\)Epq\(𝕀\(yi=ẙi\(s\)\)∣i∈s2\)=\([15](https://arxiv.org/html/2606.28795#S4.E15)\)∑i∈U\(1−πi\)α̊i\.∎\\displaystyle=\\sum\_\{i\\in U\}\(1\-\\pi\_\{i\}\)E\_\{pq\}\\Big\(\\mathbb\{I\}\\big\(y\_\{i\}=\\mathring\{y\}\_\{i\}\(s\)\\big\)\\mid i\\in s\_\{2\}\\Big\)\\overset\{\\eqref\{M\-pq\}\}\{=\}\\sum\_\{i\\in U\}\(1\-\\pi\_\{i\}\)\\mathring\{\\alpha\}\_\{i\}\.\\qed
## Appendix BDetails of Example[2](https://arxiv.org/html/2606.28795#Thmexample2)
LetYR=\(N−n\)Y¯RY\_\{R\}=\(N\-n\)\\bar\{Y\}\_\{R\}\. Providedn≥3n\\geq 3, we haveη\(xi,s\)=0\\eta\(x\_\{i\},s\)=0for anyi≠1,2i\\neq 1,2, and
Y^R\(s\)−YR=n≥3\{−1if1∉s, ă∑i∉sη\(xi,s\)=0, prob\.N−nN0if1∈s,2∈s,∑i∉sη\(xi,s\)=0, prob\.n\(n−1\)N\(N−1\)\+1if1∈s,2∉s,∑i∉sη\(xi,s\)=1, prob\.N−nNnN−1\.\\hat\{Y\}\_\{R\}\(s\)\-Y\_\{R\}\\overset\{n\\geq 3\}\{=\}\\begin\{cases\}\-1&\\text\{if $1\\notin s$, \\hskip 28\.45274pt $\\sum\_\{i\\notin s\}\\eta\(x\_\{i\},s\)=0$, prob\. $\\frac\{N\-n\}\{N\}$\}\\\\ 0&\\text\{if $1\\in s,2\\in s$, $\\sum\_\{i\\notin s\}\\eta\(x\_\{i\},s\)=0$, prob\. $\\frac\{n\(n\-1\)\}\{N\(N\-1\)\}$\}\\\\ \+1&\\text\{if $1\\in s,2\\notin s$, $\\sum\_\{i\\notin s\}\\eta\(x\_\{i\},s\)=1$, prob\. $\\frac\{N\-n\}\{N\}\\frac\{n\}\{N\-1\}$\}\.\\end\{cases\}Thus, usingη\(x,s\)\\eta\(x,s\)ispqpq\-biased for estimatingYRY\_\{R\}except whenn=N−1n=N\-1\.
Meanwhile, letY^R\(s1\)\\hat\{Y\}\_\{R\}\(s\_\{1\}\)be the prediction estimator ofYRY\_\{R\}usingη\(xi,s1\)\\eta\(x\_\{i\},s\_\{1\}\)for anyi∉si\\notin s\. Providedn1≥3n\_\{1\}\\geq 3, we haveη\(xi,s1\)=0\\eta\(x\_\{i\},s\_\{1\}\)=0for anyi≠1,2i\\neq 1,2, and
Y^R\(s1\)−YR=n1≥3\{−1if1∉s, ă∑i∉sη\(xi,s1\)=0, prob\.N−nN0if1∈s,2∈s,∑i∉sη\(xi,s1\)=0, prob\.n\(n−1\)N\(N−1\)\+1if1∈s1,2∉s,∑i∉sη\(xi,s1\)=1, prob\.N−nNn1N−10if1∈s2,2∉s,∑i∉sη\(xi,s1\)=0, prob\.N−nNn2N−1\.\\hat\{Y\}\_\{R\}\(s\_\{1\}\)\-Y\_\{R\}\\overset\{n\_\{1\}\\geq 3\}\{=\}\\begin\{cases\}\-1&\\text\{if $1\\notin s$, \\hskip 31\.29802pt $\\sum\_\{i\\notin s\}\\eta\(x\_\{i\},s\_\{1\}\)=0$, prob\. $\\frac\{N\-n\}\{N\}$\}\\\\ 0&\\text\{if $1\\in s,2\\in s$,~ $\\sum\_\{i\\notin s\}\\eta\(x\_\{i\},s\_\{1\}\)=0$, prob\. $\\frac\{n\(n\-1\)\}\{N\(N\-1\)\}$\}\\\\ \+1&\\text\{if $1\\in s\_\{1\},2\\notin s$, $\\sum\_\{i\\notin s\}\\eta\(x\_\{i\},s\_\{1\}\)=1$, prob\. $\\frac\{N\-n\}\{N\}\\frac\{n\_\{1\}\}\{N\-1\}$\}\\\\ 0&\\text\{if $1\\in s\_\{2\},2\\notin s$, $\\sum\_\{i\\notin s\}\\eta\(x\_\{i\},s\_\{1\}\)=0$, prob\. $\\frac\{N\-n\}\{N\}\\frac\{n\_\{2\}\}\{N\-1\}$\}\.\\end\{cases\}Thus, using theη¯\(x,s\)\\bar\{\\eta\}\(x,s\)predictor also leads topqpq\-biased estimation generally\.
For residual\-tuning ofη\(x,s1\)\\eta\(x,s\_\{1\}\)bycη\(s1\)=1n∑i∈sη\(xi,s1\)−y¯sc\_\{\\eta\}\(s\_\{1\}\)=\\frac\{1\}\{n\}\\sum\_\{i\\in s\}\\eta\(x\_\{i\},s\_\{1\}\)\-\\bar\{y\}\_\{s\}\. We have
cη\(s1\)=n1≥3\{0if1∉s,∑i∈sη\(xi,s1\)=0,∑i∈syi=0, prob\.N−nN1nif1∈s1,2∈s,∑i∈sη\(xi,s1\)=2,∑i∈syi=1, prob\.n1\(n−1\)N\(N−1\)−1nif1∈s2,2∈s,∑i∈sη\(xi,s1\)=0,∑i∈syi=1, prob\.n2\(n−1\)N\(N−1\)0if1∈s1,2∉s,∑i∈sη\(xi,s1\)=1,∑i∈syi=1, prob\.N−nNn1N−1−1nif1∈s2,2∉s,∑i∈sη\(xi,s1\)=0,∑i∈syi=1, prob\.N−nNn2N−1c\_\{\\eta\}\(s\_\{1\}\)\\overset\{n\_\{1\}\\geq 3\}\{=\}\\begin\{cases\}0&\\text\{if $1\\notin s$, \\hskip 31\.29802pt $\\sum\_\{i\\in s\}\\eta\(x\_\{i\},s\_\{1\}\)=0$, $\\sum\_\{i\\in s\}y\_\{i\}=0$, prob\. $\\frac\{N\-n\}\{N\}$\}\\\\ \\frac\{1\}\{n\}&\\text\{if $1\\in s\_\{1\},2\\in s$, $\\sum\_\{i\\in s\}\\eta\(x\_\{i\},s\_\{1\}\)=2$, $\\sum\_\{i\\in s\}y\_\{i\}=1$, prob\. $\\frac\{n\_\{1\}\(n\-1\)\}\{N\(N\-1\)\}$\}\\\\ \-\\frac\{1\}\{n\}&\\text\{if $1\\in s\_\{2\},2\\in s$, $\\sum\_\{i\\in s\}\\eta\(x\_\{i\},s\_\{1\}\)=0$, $\\sum\_\{i\\in s\}y\_\{i\}=1$, prob\. $\\frac\{n\_\{2\}\(n\-1\)\}\{N\(N\-1\)\}$\}\\\\ 0&\\text\{if $1\\in s\_\{1\},2\\notin s$, $\\sum\_\{i\\in s\}\\eta\(x\_\{i\},s\_\{1\}\)=1$, $\\sum\_\{i\\in s\}y\_\{i\}=1$, prob\. $\\frac\{N\-n\}\{N\}\\frac\{n\_\{1\}\}\{N\-1\}$\}\\\\ \-\\frac\{1\}\{n\}&\\text\{if $1\\in s\_\{2\},2\\notin s$, $\\sum\_\{i\\in s\}\\eta\(x\_\{i\},s\_\{1\}\)=0$, $\\sum\_\{i\\in s\}y\_\{i\}=1$, prob\. $\\frac\{N\-n\}\{N\}\\frac\{n\_\{2\}\}\{N\-1\}$\}\\end\{cases\}such that the SRB residual\-tuning ofη¯\(x,s\)\\bar\{\\eta\}\(x,s\)is given by
c¯η\(s\)=n1≥3\{0if1∉s, prob\.N−nNn1−n2n2if1∈s,2∈s, prob\.n\(n−1\)N\(N−1\)−n2n2if1∈s,2∉s, prob\.N−nNnN−1\.\\bar\{c\}\_\{\\eta\}\(s\)\\overset\{n\_\{1\}\\geq 3\}\{=\}\\begin\{cases\}0&\\text\{if $1\\notin s$, \\hskip 27\.0301pt prob\. $\\frac\{N\-n\}\{N\}$\}\\\\ \\frac\{n\_\{1\}\-n\_\{2\}\}\{n^\{2\}\}&\\text\{if $1\\in s,2\\in s$, prob\. $\\frac\{n\(n\-1\)\}\{N\(N\-1\)\}$\}\\\\ \-\\frac\{n\_\{2\}\}\{n^\{2\}\}&\\text\{if $1\\in s,2\\notin s$, prob\. $\\frac\{N\-n\}\{N\}\\frac\{n\}\{N\-1\}$\}\.\\end\{cases\}Straightforward algebra shows that this does*not*remove the bias,
1N−nEpq\(Y^R\(s1\)−YR\)−Ep\(c¯η\(s\)\)≠0\.\\frac\{1\}\{N\-n\}E\_\{pq\}\\big\(\\hat\{Y\}\_\{R\}\(s\_\{1\}\)\-Y\_\{R\}\\big\)\-E\_\{p\}\\big\(\\bar\{c\}\_\{\\eta\}\(s\)\\big\)\\neq 0\.
For out\-of\-bag tuning byτη\(s1\)=1n2∑i∈s2\(η\(xi,s1\)−yi\)\\tau\_\{\\eta\}\(s\_\{1\}\)=\\frac\{1\}\{n\_\{2\}\}\\sum\_\{i\\in s\_\{2\}\}\\big\(\\eta\(x\_\{i\},s\_\{1\}\)\-y\_\{i\}\\big\)\. We have
τη\(s1\)=n1≥3\{0if1∉s,∑i∈s2η\(xi,s1\)=0,∑i∈s2yi=0, prob\.N−nN−1n2if1∈s2,2∈s,∑i∈s2η\(xi,s1\)=0,∑i∈s2yi=1, prob\.∝n2n0if1∈s1,2∈s1,∑i∈s2η\(xi,s1\)=0,∑i∈s2yi=0, prob\.∝n1\(n1−1\)n\(n−1\)1n2if1∈s1,2∈s2,∑i∈s2η\(xi,s1\)=1,∑i∈s2yi=0, prob\.∝n2n1n\(n−1\)0if1∈s1,2∉s,∑i∈s2η\(xi,s1\)=0,∑i∈s2yi=0, prob\.N−nNn1N−1−1n2if1∈s2,2∉s,∑i∈s2η\(xi,s1\)=0,∑i∈s2yi=1, prob\.N−nNn2N−1\.\\tau\_\{\\eta\}\(s\_\{1\}\)\\overset\{n\_\{1\}\\geq 3\}\{=\}\\begin\{cases\}0&\\text\{if $1\\notin s$, \\hskip 35\.56593pt $\\sum\_\{i\\in s\_\{2\}\}\\eta\(x\_\{i\},s\_\{1\}\)=0$, $\\sum\_\{i\\in s\_\{2\}\}y\_\{i\}=0$, prob\. $\\frac\{N\-n\}\{N\}$\}\\\\ \-\\frac\{1\}\{n\_\{2\}\}&\\text\{if $1\\in s\_\{2\},2\\in s$,~ $\\sum\_\{i\\in s\_\{2\}\}\\eta\(x\_\{i\},s\_\{1\}\)=0$, $\\sum\_\{i\\in s\_\{2\}\}y\_\{i\}=1$, prob\. $\\propto\\frac\{n\_\{2\}\}\{n\}$\}\\\\ 0&\\text\{if $1\\in s\_\{1\},2\\in s\_\{1\}$, $\\sum\_\{i\\in s\_\{2\}\}\\eta\(x\_\{i\},s\_\{1\}\)=0$, $\\sum\_\{i\\in s\_\{2\}\}y\_\{i\}=0$, prob\. $\\propto\\frac\{n\_\{1\}\(n\_\{1\}\-1\)\}\{n\(n\-1\)\}$\}\\\\ \\frac\{1\}\{n\_\{2\}\}&\\text\{if $1\\in s\_\{1\},2\\in s\_\{2\}$, $\\sum\_\{i\\in s\_\{2\}\}\\eta\(x\_\{i\},s\_\{1\}\)=1$, $\\sum\_\{i\\in s\_\{2\}\}y\_\{i\}=0$, prob\. $\\propto\\frac\{n\_\{2\}n\_\{1\}\}\{n\(n\-1\)\}$\}\\\\ 0&\\text\{if $1\\in s\_\{1\},2\\notin s$,~ $\\sum\_\{i\\in s\_\{2\}\}\\eta\(x\_\{i\},s\_\{1\}\)=0$, $\\sum\_\{i\\in s\_\{2\}\}y\_\{i\}=0$, prob\. $\\frac\{N\-n\}\{N\}\\frac\{n\_\{1\}\}\{N\-1\}$\}\\\\ \-\\frac\{1\}\{n\_\{2\}\}&\\text\{if $1\\in s\_\{2\},2\\notin s$,~ $\\sum\_\{i\\in s\_\{2\}\}\\eta\(x\_\{i\},s\_\{1\}\)=0$, $\\sum\_\{i\\in s\_\{2\}\}y\_\{i\}=1$, prob\. $\\frac\{N\-n\}\{N\}\\frac\{n\_\{2\}\}\{N\-1\}$\}\.\\end\{cases\}such that the SRB out\-of\-bag tuning ofη¯\(x,s\)\\bar\{\\eta\}\(x,s\)is given by
τ¯η\(s\)=n1≥3\{0if1∉s, prob\.N−nN−n2−1n\(n−1\)if1∈s,2∈s, prob\.n\(n−1\)N\(N−1\)−1nif1∈s,2∉s, prob\.N−nNnN−1\.\\bar\{\\tau\}\_\{\\eta\}\(s\)\\overset\{n\_\{1\}\\geq 3\}\{=\}\\begin\{cases\}0&\\text\{if $1\\notin s$, \\hskip 27\.0301pt prob\. $\\frac\{N\-n\}\{N\}$\}\\\\ \-\\frac\{n\_\{2\}\-1\}\{n\(n\-1\)\}&\\text\{if $1\\in s,2\\in s$, prob\. $\\frac\{n\(n\-1\)\}\{N\(N\-1\)\}$\}\\\\ \-\\frac\{1\}\{n\}&\\text\{if $1\\in s,2\\notin s$, prob\. $\\frac\{N\-n\}\{N\}\\frac\{n\}\{N\-1\}$\}\.\\end\{cases\}Straightforward algebra to shows that this does remove the bias,
1N−nEpq\(Y^R\(s1\)−YR\)−Ep\(τ¯η\(s\)\)=0\.\\frac\{1\}\{N\-n\}E\_\{pq\}\\big\(\\hat\{Y\}\_\{R\}\(s\_\{1\}\)\-Y\_\{R\}\\big\)\-E\_\{p\}\\big\(\\bar\{\\tau\}\_\{\\eta\}\(s\)\\big\)=0\.
## Appendix CDesign\-based bias and MSE
The derivation follows the approach of Zhang et al\. \(2025\)\.
#### Bias of SRB\-kNN
The prediction error due toη\(x,s1\)\\eta\(x,s\_\{1\}\)can be written as
B\(s1\)=1N∑i∉sei\(s1\)andei\(s1\)=η\(xi,s1\)−yi\.B\(s\_\{1\}\)=\\frac\{1\}\{N\}\\sum\_\{i\\notin s\}e\_\{i\}\(s\_\{1\}\)\\quad\\text\{and\}\\quad e\_\{i\}\(s\_\{1\}\)=\\eta\(x\_\{i\},s\_\{1\}\)\-y\_\{i\}\.Givenπ2i=n2N−n1\\pi\_\{2i\}=\\frac\{n\_\{2\}\}\{N\-n\_\{1\}\}under SRS\-SRS, an unbiased predictor based ons2s\_\{2\}is
B^\(s1\)=N−nNn2∑i∈s2ei\(s1\)whereEs\(B^\(s1\)−B\(s1\)∣s1\)=0\.\\hat\{B\}\(s\_\{1\}\)=\\frac\{N\-n\}\{Nn\_\{2\}\}\\sum\_\{i\\in s\_\{2\}\}e\_\{i\}\(s\_\{1\}\)\\quad\\text\{where\}\\quad E\_\{s\}\\big\(\\hat\{B\}\(s\_\{1\}\)\-B\(s\_\{1\}\)\\mid s\_\{1\}\\big\)=0\.For the corresponding error of the SRB\-kNNη¯\(x,s\)\\bar\{\\eta\}\(x,s\), denoted byB¯\(s\)\\bar\{B\}\(s\), we have
B¯^\(s\)=N−nNn2Eq\(∑i∈s2ei\(s1\)\)=N−nNn∑i∈s\(η˙\(xi,s\)−yi\)\.\\hat\{\\bar\{B\}\}\(s\)=\\frac\{N\-n\}\{Nn\_\{2\}\}E\_\{q\}\\Big\(\\sum\_\{i\\in s\_\{2\}\}e\_\{i\}\(s\_\{1\}\)\\Big\)=\\frac\{N\-n\}\{Nn\}\\sum\_\{i\\in s\}\\big\(\\dot\{\\eta\}\(x\_\{i\},s\)\-y\_\{i\}\\big\)\.\(18\)
#### Bias of residual\-tuned SRB\-kNN
The prediction error due toη\(x,s1\)\\eta\(x,s\_\{1\}\)is
Bc\(s1\)\\displaystyle B\_\{c\}\(s\_\{1\}\)=1N∑i∉sei\(s1\)\+N−nN1n∑i∈s\(yi−η\(xi,s1\)\)\\displaystyle=\\frac\{1\}\{N\}\\sum\_\{i\\notin s\}e\_\{i\}\(s\_\{1\}\)\+\\frac\{N\-n\}\{N\}\\frac\{1\}\{n\} \\sum\_\{i\\in s\}\\big\(y\_\{i\}\-\\eta\(x\_\{i\},s\_\{1\}\)\\big\)=1N∑i∉sei\(s1\)−\(1n−1N\)\(∑i∈s1ei\(s1\)\+∑i∈s2ei\(s1\)\)\\displaystyle=\\frac\{1\}\{N\}\\sum\_\{i\\notin s\}e\_\{i\}\(s\_\{1\}\)\-\\Big\(\\frac\{1\}\{n\} \-\\frac\{1\}\{N\}\\Big\) \\Big\(\\sum\_\{i\\in s\_\{1\}\}e\_\{i\}\(s\_\{1\}\)\+\\sum\_\{i\\in s\_\{2\}\}e\_\{i\}\(s\_\{1\}\)\\Big\)=N−nNe¯U∖s\(s1\)−\(1n−1N\)\(n1e¯s1\(s1\)\+n2e¯s2\(s1\)\)\\displaystyle=\\frac\{N\-n\}\{N\}\\bar\{e\}\_\{U\\setminus s\}\(s\_\{1\}\)\-\\Big\(\\frac\{1\}\{n\} \-\\frac\{1\}\{N\}\\Big\) \\Big\(n\_\{1\}\\bar\{e\}\_\{s\_\{1\}\}\(s\_\{1\}\)\+n\_\{2\}\\bar\{e\}\_\{s\_\{2\}\}\(s\_\{1\}\)\\Big\)wheree¯D\(s1\)\\bar\{e\}\_\{D\}\(s\_\{1\}\)denotes the average ofei\(s1\)e\_\{i\}\(s\_\{1\}\)overi∈Di\\in Dgiven any setDD\. Conditional ons1s\_\{1\}, an unbiased predictor ofBc\(s1\)B\_\{c\}\(s\_\{1\}\)follows as
B^c\(s1\)\\displaystyle\\hat\{B\}\_\{c\}\(s\_\{1\}\)=\(1−nN\)e¯s2\(s1\)−\(1n−1N\)\(n1e¯s1\(s1\)\+n2e¯s2\(s1\)\)\\displaystyle=\\Big\(1 \-\\frac\{n\}\{N\}\\Big\)\\bar\{e\}\_\{s\_\{2\}\}\(s\_\{1\}\)\-\\Big\(\\frac\{1\}\{n\} \-\\frac\{1\}\{N\}\\Big\) \\Big\(n\_\{1\}\\bar\{e\}\_\{s\_\{1\}\}\(s\_\{1\}\)\+n\_\{2\}\\bar\{e\}\_\{s\_\{2\}\}\(s\_\{1\}\)\\Big\)=n1\(1n−1N\)\(e¯s2\(s1\)−e¯s1\(s1\)\)\\displaystyle=n\_\{1\}\\Big\(\\frac\{1\}\{n\} \-\\frac\{1\}\{N\}\\Big\)\\Big\(\\bar\{e\}\_\{s\_\{2\}\}\(s\_\{1\}\)\-\\bar\{e\}\_\{s\_\{1\}\}\(s\_\{1\}\)\\Big\)such that an unbiased estimator of the bias of residual\-tuned SRB\-kNN is
B¯^c\(s\)=Eq\(B^c\(s1\)∣s\)\.\\hat\{\\bar\{B\}\}\_\{c\}\(s\)=E\_\{q\}\\big\(\\hat\{B\}\_\{c\}\(s\_\{1\}\)\\mid s\\big\)\.\(19\)
#### MSE of SRB\-kNN
The errorB\(s1\)B\(s\_\{1\}\)has been given earlier\. Conditional ons1s\_\{1\}, an unbiased estimator of the conditional MSE ofP^\(s1\)\\hat\{P\}\(s\_\{1\}\)can be given as
mse\(s1\)=B^\(s1\)2−V^s\{B^\(s1\)∣s1\}\+V^s\{B\(s1\)∣s1\}\\mbox\{mse\}\(s\_\{1\}\)=\\hat\{B\}\(s\_\{1\}\)^\{2\}\-\\hat\{V\}\_\{s\}\\\{\\hat\{B\}\(s\_\{1\}\)\\mid s\_\{1\}\\\}\+\\hat\{V\}\_\{s\}\\\{B\(s\_\{1\}\)\\mid s\_\{1\}\\\}\(Theorem 1, Zhang et al\., 2025\), where
V^s\{B^\(s1\)∣s1\}=\(1−nN\)2v2\\displaystyle\\hat\{V\}\_\{s\}\\\{\\hat\{B\}\(s\_\{1\}\)\\mid s\_\{1\}\\\}=\\Big\(1 \-\\frac\{n\}\{N\}\\Big\)^\{2\}v\_\{2\}v2=V^s\(e¯s2\(s1\)∣s1\)=\(1n2−1N−n1\)1n2−1∑i∈s2\{ei\(s1\)−e¯s2\(s1\)\}2\\displaystyle v\_\{2\}=\\hat\{V\}\_\{s\}\\big\(\\bar\{e\}\_\{s\_\{2\}\}\(s\_\{1\}\)\\mid s\_\{1\}\\big\)=\\Big\(\\frac\{1\}\{n\_\{2\}\} \-\\frac\{1\}\{N\-n\_\{1\}\}\\Big\)\\frac\{1\}\{n\_\{2\}\-1\}\\sum\_\{i\\in s\_\{2\}\}\\\{e\_\{i\}\(s\_\{1\}\)\-\\bar\{e\}\_\{s\_\{2\}\}\(s\_\{1\}\)\\\}^\{2\}V^s\{B\(s1\)∣s1\}=∑i∈s2∑j∈s2\(1−π2iπ2jπ2ij\)ei\(s1\)ej\(s1\)\\displaystyle\\hat\{V\}\_\{s\}\\\{B\(s\_\{1\}\)\\mid s\_\{1\}\\\}=\\sum\_\{i\\in s\_\{2\}\}\\sum\_\{j\\in s\_\{2\}\}\\Big\(1\-\\frac\{\\pi\_\{2i\}\\pi\_\{2j\}\}\{\\pi\_\{2ij\}\}\\Big\)e\_\{i\}\(s\_\{1\}\)e\_\{j\}\(s\_\{1\}\)givenπ2i≡n2N−n1\\pi\_\{2i\}\\equiv\\frac\{n\_\{2\}\}\{N\-n\_\{1\}\}, andπ2ij=π2i\\pi\_\{2ij\}=\\pi\_\{2i\}ifi=ji=jorπ2ij=n2\(n2−1\)\(N−n1\)\(N−n1−1\)\\pi\_\{2ij\}=\\frac\{n\_\{2\}\(n\_\{2\}\-1\)\}\{\(N\-n\_\{1\}\)\(N\-n\_\{1\}\-1\)\}ifi≠ji\\neq j\. A design\-unbiased estimator of the MSE of the correspondingP^\\hat\{P\}follows as
mse¯=Eq\(mse\(s1\)∣s\)−Vq\(P^\(s1\)∣s\)\.\\overline\{\\mbox\{mse\}\}=E\_\{q\}\\big\(\\mbox\{mse\}\(s\_\{1\}\)\\mid s\\big\)\-V\_\{q\}\\big\(\\hat\{P\}\(s\_\{1\}\)\\mid s\\big\)\.\(20\)
#### MSE of residual\-tuned SRB\-kNN
The errorBc\(s1\)B\_\{c\}\(s\_\{1\}\)is given earlier\. Similarly to that of SRB\-kNN, we have
msec\(s1\)=B^c\(s1\)2−V^s\{B^c\(s1\)∣s1\}\+V^s\{Bc\(s1\)∣s1\}\\mbox\{mse\}\_\{c\}\(s\_\{1\}\)=\\hat\{B\}\_\{c\}\(s\_\{1\}\)^\{2\}\-\\hat\{V\}\_\{s\}\\\{\\hat\{B\}\_\{c\}\(s\_\{1\}\)\\mid s\_\{1\}\\\}\+\\hat\{V\}\_\{s\}\\\{B\_\{c\}\(s\_\{1\}\)\\mid s\_\{1\}\\\}where
V^s\{B^c\(s1\)∣s1\}=n12\(1n−1N\)2v2andV^s\{Bc\(s1\)∣s1\}=\(n2n\)2v2\.\\hat\{V\}\_\{s\}\\\{\\hat\{B\}\_\{c\}\(s\_\{1\}\)\\mid s\_\{1\}\\\}=n\_\{1\}^\{2\}\\Big\(\\frac\{1\}\{n\} \-\\frac\{1\}\{N\}\\Big\)^\{2\}v\_\{2\}\\qquad\\text\{and\}\\qquad\\hat\{V\}\_\{s\}\\\{B\_\{c\}\(s\_\{1\}\)\\mid s\_\{1\}\\\}=\\Big\(\\frac\{n\_\{2\}\}\{n\}\\Big\)^\{2\}v\_\{2\}\.Notice that, regarding the variance ofBc\(s1\)B\_\{c\}\(s\_\{1\}\)conditional ons1s\_\{1\}, we can simplify the random terms inBc\(s1\)B\_\{c\}\(s\_\{1\}\)as
N−nNe¯U∖s\(s1\)−N−nNn2ne¯s2\(s1\)\\displaystyle\\frac\{N\-n\}\{N\}\\bar\{e\}\_\{U\\setminus s\}\(s\_\{1\}\)\-\\frac\{N\-n\}\{N\}\\frac\{n\_\{2\}\}\{n\}\\bar\{e\}\_\{s\_\{2\}\}\(s\_\{1\}\)=N−n1Ne¯U∖s1\(s1\)−n2Ne¯s2\(s1\)−N−nnn2Ne¯s2\(s1\)\\displaystyle=\\frac\{N\-n\_\{1\}\}\{N\}\\bar\{e\}\_\{U\\setminus s\_\{1\}\}\(s\_\{1\}\)\-\\frac\{n\_\{2\}\}\{N\}\\bar\{e\}\_\{s\_\{2\}\}\(s\_\{1\}\)\-\\frac\{N\-n\}\{n\}\\frac\{n\_\{2\}\}\{N\}\\bar\{e\}\_\{s\_\{2\}\}\(s\_\{1\}\)=N−n1Ne¯U∖s1\(s1\)−n2ne¯s2\(s1\)\\displaystyle=\\frac\{N\-n\_\{1\}\}\{N\}\\bar\{e\}\_\{U\\setminus s\_\{1\}\}\(s\_\{1\}\)\-\\frac\{n\_\{2\}\}\{n\}\\bar\{e\}\_\{s\_\{2\}\}\(s\_\{1\}\)wheree¯U∖s1\(s1\)\\bar\{e\}\_\{U\\setminus s\_\{1\}\}\(s\_\{1\}\)is a constant givens1s\_\{1\}\. A design\-unbiased estimator of the MSE of the correspondingP~\\tilde\{P\}follows as
mse¯c=Eq\(msec\(s1\)∣s\)−Vq\(P^\(s1\)∣s\)\.\\overline\{\\mbox\{mse\}\}\_\{c\}=E\_\{q\}\\big\(\\mbox\{mse\}\_\{c\}\(s\_\{1\}\)\\mid s\\big\)\-V\_\{q\}\\big\(\\hat\{P\}\(s\_\{1\}\)\\mid s\\big\)\.\(21\)
#### MSE of OOB\-tuned SRB\-kNN
The error due to the tunedη\(x,s1\)\\eta\(x,s\_\{1\}\)is given as
Bτ\(s1\)=1N∑i∉sei\(s1\)−N−nN1n2∑i∈s2ei\(s1\)=N−nN\(e¯U∖s\(s1\)−e¯s2\(s1\)\)B\_\{\\tau\}\(s\_\{1\}\)=\\frac\{1\}\{N\}\\sum\_\{i\\notin s\}e\_\{i\}\(s\_\{1\}\)\-\\frac\{N\-n\}\{N\}\\frac\{1\}\{n\_\{2\}\} \\sum\_\{i\\in s\_\{2\}\}e\_\{i\}\(s\_\{1\}\)=\\frac\{N\-n\}\{N\}\\big\(\\bar\{e\}\_\{U\\setminus s\}\(s\_\{1\}\)\-\\bar\{e\}\_\{s\_\{2\}\}\(s\_\{1\}\)\\big\)which has expectation0conditional ons1s\_\{1\}, since tuning has made it unbiased\. It follows that we can useB^τ\(s1\)≡0\\hat\{B\}\_\{\\tau\}\(s\_\{1\}\)\\equiv 0, such thatVs\{B^τ\(s1\)∣s1\}≡0V\_\{s\}\\\{\\hat\{B\}\_\{\\tau\}\(s\_\{1\}\)\\mid s\_\{1\}\\\}\\equiv 0as well\. Regarding the variance ofBτ\(s1\)B\_\{\\tau\}\(s\_\{1\}\)conditional ons1s\_\{1\}, we have now
e¯U∖s\(s1\)−e¯s2\(s1\)\\displaystyle\\bar\{e\}\_\{U\\setminus s\}\(s\_\{1\}\)\-\\bar\{e\}\_\{s\_\{2\}\}\(s\_\{1\}\)=N−n1N−ne¯U∖s1\(s1\)−n2N−ne¯s2\(s1\)−e¯s2\(s1\)\\displaystyle=\\frac\{N\-n\_\{1\}\}\{N\-n\}\\bar\{e\}\_\{U\\setminus s\_\{1\}\}\(s\_\{1\}\)\-\\frac\{n\_\{2\}\}\{N\-n\}\\bar\{e\}\_\{s\_\{2\}\}\(s\_\{1\}\)\-\\bar\{e\}\_\{s\_\{2\}\}\(s\_\{1\}\)=N−n1N−ne¯U∖s1\(s1\)−N−n1N−ne¯s2\(s1\)\\displaystyle=\\frac\{N\-n\_\{1\}\}\{N\-n\}\\bar\{e\}\_\{U\\setminus s\_\{1\}\}\(s\_\{1\}\)\-\\frac\{N\-n\_\{1\}\}\{N\-n\}\\bar\{e\}\_\{s\_\{2\}\}\(s\_\{1\}\)wheree¯U∖s1\(s1\)\\bar\{e\}\_\{U\\setminus s\_\{1\}\}\(s\_\{1\}\)is a constant givens1s\_\{1\}, such that
mseτ\(s1\)=V^s\{Bτ\(s1\)∣s1\}=\(1−n1N\)2v2\.\\mbox\{mse\}\_\{\\tau\}\(s\_\{1\}\)=\\hat\{V\}\_\{s\}\\\{B\_\{\\tau\}\(s\_\{1\}\)\\mid s\_\{1\}\\\}=\\Big\(1\-\\frac\{n\_\{1\}\}\{N\}\\Big\)^\{2\}v\_\{2\}\.A design\-unbiased estimator of the MSE of the correspondingP^\\hat\{P\}follows as
mse¯τ=Eq\(mseτ\(s1\)∣s\)−Vq\(P^\(s1\)∣s\)\.\\overline\{\\mbox\{mse\}\}\_\{\\tau\}=E\_\{q\}\\big\(\\mbox\{mse\}\_\{\\tau\}\(s\_\{1\}\)\\mid s\\big\)\-V\_\{q\}\\big\(\\hat\{P\}\(s\_\{1\}\)\\mid s\\big\)\.\(22\)
## Appendix DMonte Carlo SRB
Exact calculation ofEq\(⋅\)E\_\{q\}\(\\cdot\)orVq\(⋅\)V\_\{q\}\(\\cdot\)is infeasible if𝒞\(n,n1\)\\mathcal\{C\}\(n,n\_\{1\}\)is too large\. In such situations, theqq\-expectation andqq\-variance need to be evaluated by Monte Carlo based onTT\(training, test\) sets, denoted by\(s1\(t\),s2\(t\)\)\(s\_\{1\}^\{\(t\)\},s\_\{2\}^\{\(t\)\}\)fort=1,…,Tt=1,\.\.\.,T\.
For bias estimation by \([18](https://arxiv.org/html/2606.28795#A3.E18)\), one would calculate theqq\-expectation as
Eq\(1n2∑i∈s2ei\(s1\)\)≈1T∑t=1T1n2∑i∈s2\(t\)ei\(s1\(t\)\)=1T∑t=1T1n2∑i∈s2\(t\)\(η\(xi,s1\(t\)\)−yi\)\.E\_\{q\}\\Big\(\\frac\{1\}\{n\_\{2\}\}\\sum\_\{i\\in s\_\{2\}\}e\_\{i\}\(s\_\{1\}\)\\Big\)\\approx\\frac\{1\}\{T\} \\sum\_\{t=1\}^\{T\}\\frac\{1\}\{n\_\{2\}\} \\sum\_\{i\\in s\_\{2\}^\{\(t\)\}\}e\_\{i\}\(s\_\{1\}^\{\(t\)\}\)=\\frac\{1\}\{T\} \\sum\_\{t=1\}^\{T\}\\frac\{1\}\{n\_\{2\}\} \\sum\_\{i\\in s\_\{2\}^\{\(t\)\}\}\\big\(\\eta\(x\_\{i\},s\_\{1\}^\{\(t\)\}\)\-y\_\{i\}\\big\)\.Similarly for otherqq\-expectations\. For theqq\-variance in \([20](https://arxiv.org/html/2606.28795#A3.E20)\), one would use
Vq\(P^\(s1\)∣s\)≈1T−1\(∑t=1TP^\(s1\(t\)\)−1T∑t=1TP^\(s1\(t\)\)\)2\.V\_\{q\}\\big\(\\hat\{P\}\(s\_\{1\}\)\\mid s\\big\)\\approx\\frac\{1\}\{T\-1\} \\Big\(\\sum\_\{t=1\}^\{T\}\\hat\{P\}\(s\_\{1\}^\{\(t\)\}\)\-\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\hat\{P\}\(s\_\{1\}^\{\(t\)\}\)\\Big\)^\{2\}\.Similarly for theqq\-variance in \([21](https://arxiv.org/html/2606.28795#A3.E21)\) or \([22](https://arxiv.org/html/2606.28795#A3.E22)\)\.
Finally, as explained in Zhang et al\. \(2025, Section 2\.3\.2\), one needs to estimate the MSE of the Monte Carlo SRB\-predictor being implemented for out\-of\-sample prediction \(instead of the exact SRB\-predictor\)\. Specifically, in terms of \([20](https://arxiv.org/html/2606.28795#A3.E20)\) of SRB\-kNN, one would use the unbiased estimator
mse¯~=1T∑t=1T\(mse\(s1\(t\)\)−\{P^\(s1\(t\)\)−1T∑t=1TP^\(s1\(t\)\)\}2\)\.\\tilde\{\\overline\{\\mbox\{mse\}\}\}=\\frac\{1\}\{T\} \\sum\_\{t=1\}^\{T\}\\Big\(\\mbox\{mse\}\(s\_\{1\}^\{\(t\)\}\)\-\\Big\\\{\\hat\{P\}\(s\_\{1\}^\{\(t\)\}\)\-\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\hat\{P\}\(s\_\{1\}^\{\(t\)\}\)\\Big\\\}^\{2\}\\Big\)\.Similarly for \([21](https://arxiv.org/html/2606.28795#A3.E21)\) of residual\-tuned SRB\-kNN, and \([22](https://arxiv.org/html/2606.28795#A3.E22)\) of OOB\-tuned SRB\-kNN\.
## References
- \[1\]Angelopoulos, A\., Bates, S\., Fannjiang, A\., Jordan, M\.I\. and Zrnic, T\. \(2023\)\. Prediction\-powered inference\.Science, 382:669–674\(2023\)\.[DOI:10\.1126/science\.adi6000](https://arxiv.org/html/2606.28795v1/DOI:10.1126/science.adi6000)
- \[2\]Beaumont, J\.\-F\. and Haziza, D\. \(2022\)\. Statistical inference from finite population samples: A critical review of frequentist and Bayesian approaches\.The Canadian Journal of Statistics, 50:1186\-1212\.
- \[3\]Blackwell, D\. \(1947\)\. Conditional expectation and unbiased sequential estimation\.Annals of Mathematical Statistics, 18: 105\-110\.
- \[4\]Breidt, F\. J\. and Opsomer, J\. D\. \(2017\)\. Model\-assisted survey estimation with modern prediction techniques\.Statistical Science, 32:190\-205\.
- \[5\]Breiman, L\. \(2001a\)\.Statistical modelling: The two cultures\.Statistical Science, 16:199\-231\.
- \[6\]Breiman, L\. \(2001b\)\. Random Forests\.Machine Learning, 45:5\-32\.
- \[7\]Dagdoug, M\., Goga, C\. and Haziza, D\. \(2021\)\. Model\-Assisted Estimation Through Random Forests in Finite Population Sampling\.Journal of the American Statistical Association, 118:1234\-1251\.
- \[8\]European Commission\(2017\)\.European Statistics Code of Practice\.[https://ec\.europa\.eu/eurostat/web/quality/european\-statistics\-code\-of\-practice](https://ec.europa.eu/eurostat/web/quality/european-statistics-code-of-practice)
- \[9\]Fisher, R\.A\. \(1956\)\.Statistical Methods and Scientific Inference\. Oliver and Boyd, Edinburgh and London\.
- \[10\]Hansen, M\. \(1987\)\. Some History and Reminiscences on Survey Sampling\.Statistical Science, 2:180\-190\.
- \[11\]Horvitz, D\. G\. and Thompson, D\. J\. \(1952\)\. A generalization of sampling without replacement from a finite universe\.Journal of the American Statistical Association, 47:663\-685\.
- \[12\]Kalton, G\. \(2002\)\. Models in practice of survey sampling\.Journal of Official Statistics, 18:129\-154\.
- \[13\]McConville KS, and Toth D\. \(2019\)\. Automated selection of post\-strata using a model\-assisted regression tree estimator\.Scandinavian Journal of Statistics, 46:389\-413\.
- \[14\]Neyman, J\. \(1934\)\. On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection\.Journal of the Royal Statistical Society, pp\. 558\-625\.
- \[15\]Rao, C\. R\. \(1945\)\. Information and accuracy attainable in the estimation of statistical parameters\.Bulletin of Calcutta Mathematical Society, 37:81\-91\.
- \[16\]Rao, J\. N\. K\. \(2005\)\. Interplay between sample survey theory and practice: An appraisal\.Survey Methodology, 31:117\-138\.
- \[17\]Rao, J\. N\. K\. \(2011\)\. Impact of frequentist and Bayesian methods on survey sampling practice: A selective appraisal\.Statistical Science, 26:240\-256\.
- \[18\]Royall, R\.M\. \(1970\)\. On finite population sampling theory under certain linear regression models\.Biometrika, 57:377\-387\.
- \[19\]Sanguiao\-Sande, L\. and Zhang, L\.\-C\. \(2021\)\. Design\-Unbiased Statistical Learning in Survey Sampling\.Sankhya A, Centenary Issue in Honour of C\. R\. Rao, 83:714\-744\.
- \[20\]Statistics Canada \(2017\)\.Quality Assurance Framework, 3rd edition\.[https://www150\.statcan\.gc\.ca/n1/pub/12\-539\-x/12\-539\-x2019001\-eng\.htm](https://www150.statcan.gc.ca/n1/pub/12-539-x/12-539-x2019001-eng.htm)
- \[21\]Särndal, C\.\-E\., Swensson, B\., and Wretman, J\. \(1992\)\.Model\-Assisted Survey Sampling\. Springer, New York\.
- \[22\]Smith, T\. M\. F\. \(1994\)\. Sample surveys 1975–1990; an age of reconciliation? \(with discussion\)\.International Statistical Review, 62:5\-34\.
- \[23\]Stone, C\.J\. \(1977\)\. Consistent Nonparametric Regression\.The Annals of Statistics, 5:595\-620\.
- \[24\]United Nations \(2019\)\.National Quality Assurance Frameworks Manual for Official Statistics\.[https://unstats\.un\.org/unsd/methodology/dataquality/](https://unstats.un.org/unsd/methodology/dataquality/)
- \[25\]Valliant, R\., Dorfman, R\. M\., and Royall, R\. M\. \(2000\)\.Finite Population Sampling and Inference: A Prediction Approach\. Wiley, New York\.
- \[26\]Zhang, L\.\-C\., Sanguiao\-Sande, L\. and Lee, D\. \(2025\)\. Design\-based predictive inference\.Journal of Official Statistics, 41:404\-432\.Similar Articles
Statistical and Structural Approaches to Algorithmic Fairness
This doctoral thesis critiques current fairness metrics in machine learning and proposes statistical hypothesis testing and structural analysis to address bias, emphasizing network and hierarchical contexts.
Detecting and Mitigating Bias by Treating Fairness as a Symmetry Operation
The paper proposes treating fairness as a symmetry operation in machine learning classifiers, implementing loss-based regularization to enforce invariance under swapping of sensitive attributes while holding merit features fixed. The framework achieves over 90% bias reduction with minimal accuracy loss and requires no causal graph knowledge.
Debiased Model-based Representations for Sample-efficient Continuous Control
This paper introduces the DR.Q algorithm, which improves model-based representations for Q-learning by maximizing mutual information and using faded prioritized experience replay to reduce bias and overfitting in continuous control tasks.
Comparative Evaluation of Machine Learning Approaches for Minority-Class Financial Distress Prediction Under Class Imbalance Constraints
This paper presents a comparative evaluation of classical, ensemble, and neural machine learning approaches for predicting financial distress under severe class imbalance, using SMOTE for oversampling and SHAP for interpretability.
@SOURADIPCHAKR18: Typical RL algorithms and on-policy distillation methods are blind samplers: they use privileged info to score rollouts…
This work proposes using privileged information to actively sample rollouts in reinforcement learning, improving on typical blind sampling methods.