TILT: Target-induced loss tilting under covariate shift

arXiv cs.LG Papers

Summary

TILT introduces a novel objective for unsupervised domain adaptation under covariate shift that penalizes an auxiliary component on unlabeled target data, implicitly achieving self-localized importance weighting with bounded estimands. Theoretical guarantees and experiments on shifted CIFAR-100 show improved target performance over baselines.

arXiv:2605.14280v1 Announce Type: new Abstract: We introduce and analyze Target-Induced Loss Tilting (TILT) for unsupervised domain adaptation under covariate shift. It is based on a novel objective function that decomposes the source predictor as $f+b$, fits $f+b$ on labeled source data while simultaneously penalizing the auxiliary component $b$ on unlabeled target inputs. The resulting fit $f$ is deployed as the final target predictor. At the population level, we show that this target-side penalty implicitly induces relative importance weighting at the population level, but in terms of an estimand $b^*_f$ that is self-localized to the current error, and remains uniformly bounded for any source-target pair (even those with disjoint supports). We prove a general finite-sample oracle inequality on the excess risk, and use it to give an end-to-end guarantee for training with sparse ReLU networks. Experiments on controlled regression problems and shifted CIFAR-100 distillation show that TILT improves target-domain performance over source-only training, exact importance weighting, and relative density-ratio baselines, with a stable dependence on the regularization parameter.
Original Article
View Cached Full Text

Cached at: 05/15/26, 06:29 AM

# 1 Introduction
Source: [https://arxiv.org/html/2605.14280](https://arxiv.org/html/2605.14280)
\\PaperBibliographyStyle

alpha\\PaperBibliographyreferences\\PaperBibliographynew\_refs\\PaperTitleTILT: Target\-induced loss tilting under covariate shift\\PaperAuthorsNeurIPSKakei Yamamoto Lab for Information and Decision Systems Statistics and Data Science Center EECS, Massachusetts Institute of Technology kakei@mit\.edu&Martin J\. Wainwright Lab for Information and Decision Systems Statistics and Data Science Center Mathematics and EECS, Massachusetts Institute of Technology mjwain@mit\.edu\\PaperAuthorsArxiv

Lab for Information and Decision SystemsStatistics and Data Science CenterEECS†,⋆and Mathematics⋆Massachusetts Institute of Technology\\PaperAbstract

We introduce and analyze Target\-Induced Loss Tilting \(TILT\) for unsupervised domain adaptation under covariate shift\. It is based on a novel objective function that decomposes the source predictor asf\+bf\+b, fitsf\+bf\+bon labeled source data while simultaneously penalizing the auxiliary componentbbon unlabeled target inputs\. The resulting fitffis deployed as the final target predictor\. At the population level, we show that this target\-side penalty implicitly induces relative importance weighting at the population level, but in terms of an estimandbf∗b^\{\*\}\_\{f\}that is*self\-localized*to the current error, and remains*uniformly bounded*for any source\-target pair \(even those with disjoint supports\)\. We prove a general finite\-sample oracle inequality on the excess risk, and use it to give an end\-to\-end guarantee for training with sparse ReLU networks\. Experiments on controlled regression problems and shifted CIFAR\-100 distillation show thatTILTimproves target\-domain performance over source\-only training, exact importance weighting, and relative density\-ratio baselines, with a stable dependence on the regularization parameter\.\\MakePaperTitle

Many prediction systems are trained on labeled data from a source domain and then deployed on inputs drawn from a different target domain\. This mismatch arises in applications such as medical diagnosis across patient populations, animal classification across acquisition environments, and poverty prediction across geographic settings \(e\.g\., see the survey paper\(koh2021wilds\)and references therein\)\. We study this problem under covariate shift: the source and target covariate distributions differ, but the conditional law of the response given the covariate is shared\(shimodaira2000improving;sugiyama2012density\)\. In this setting, ordinary source empirical risk minimization need not target the relevant prediction criterion, since it weights errors according to the source marginal rather than the target marginal\.

The standard correction is importance weighting\. Ifppandqqdenote the source and target covariate densities, respectively, then the target risk can be expressed as a source expectation weighted by the density ratioq​\(x\)/p​\(x\)q\(x\)/p\(x\)\. This identity is exact, but its direct use has two drawbacks\. First, the ratio can be highly variable or unbounded, which increases statistical variance and can destabilize optimization\. Second, the ratio is typically unknown and must be estimated from unlabeled source and target samples\. Direct density\-ratio methods such as KMM, KLIEP, and LSIF avoid separately estimatingppandqq, but still require an explicit weight\-estimation stage\(huang2007correcting;gretton2008covariate;sugiyama2008direct;kanamori2009least\)\. Relative density\-ratio methods such as RuLSIF replace the ordinary ratio by a regularized ratio to improve stability\(yamada2011relative\), while one\-step approaches jointly optimize prediction and weighting components\(zhang2020onestep\)\. These methods differ algorithmically, but they remain explicitly weight\-based\. See[Section˜1\.1](https://arxiv.org/html/2605.14280#S1.SS1)for a more detailed discussion and this and other related work\.

In this paper, we propose a novel method known asTarget\-Induced Loss Tilting\(TILT\), a one\-step method for covariate\-shift adaptation that removes density\-ratio estimation from the algorithm\. TheTILTmethod decomposes the source predictor asf\+bf\+b\. The sumf\+bf\+bis fit on labeled source data, while the auxiliary componentbbis penalized on unlabeled target covariates; the deployed predictor isff\. The role ofbbis to absorb structure useful for fitting the source distribution, subject to the constraint that this auxiliary structure is suppressed on the target covariate distribution\. Thus the target sample enters only through a penalty on the auxiliary component, not through estimated density ratios or target labels\.

A key technical property of theTILTmethod is that this target\-side penalty induces a relative weighting criterion after profiling out the auxiliary component\. For the least\-squares population objective, optimizing overbbyields an exactλ\\lambda\-relative weighted target excess risk\. For a given predictorff, the corresponding optimal offset is given bybf∗​\(x\)=−p​\(x\)p​\(x\)\+λ​q​\(x\)​\{f​\(x\)−f⋆​\(x\)\}b^\{\*\}\_\{f\}\(x\)=\-\\frac\{p\(x\)\}\{p\(x\)\+\\lambda q\(x\)\}\\\{f\(x\)\-f^\{\\star\}\(x\)\\\}, wheref∗​\(x\)=𝔼​\[Y∣X=x\]f^\{\*\}\(x\)=\\mathbb\{E\}\[Y\\mid X=x\]is the optimal predictor\. This expression involves the*product*of the two terms: the bounded factorp​\(x\)/\(p​\(x\)\+λ​q​\(x\)\)p\(x\)/\(p\(x\)\+\\lambda q\(x\)\), rather than the possibly unbounded density ratioq​\(x\)/p​\(x\)q\(x\)/p\(x\), and the current prediction errorf−f∗f\-f^\{\*\}\. Consequently, theTILTmethod has a*self\-localizing property*: the auxiliary task becomes easier as the main predictorffimproves, and the resulting procedure implementsf∗f^\{\*\}\-targeted covariate\-shift correction without explicitly estimatingq/pq/p\.

Contributions:Let us summarize our contributions\. First, we introduceTILT, a one\-step procedure for covariate\-shift adaptation that fits an additive predictorf\+bf\+bon labeled source data, penalizes the auxiliary componentbbon unlabeled target covariates, and deploys onlyff\. Unlike importance\-weighting methods,TILTnever estimates, clips, or outputs a density ratio\.

Second, we provide a rigorous theoretical under\-pinning for the method\. Our first theoretical result \([Section˜2\.1](https://arxiv.org/html/2605.14280#S2.SS1)\) shows that this simple target\-side penalty has attractive population\-level properties\. Profiling out the auxiliary componentbbyields aλ\\lambda\-relative weighted target excess risk\. Moreover, the optimal offsetbf∗b^\{\*\}\_\{f\}is uniformly bounded for arbitrary source–target pairs and is localized to the current prediction error\. Thus,TILTimplements covariate\-shift correction without requiring density\-ratio estimation, and in a way that is self\-localized to the errorf−f∗f\-f^\{\*\}\.

In addition, we give two finite\-sample guarantees for the data\-based procedure used in practice:[Theorem˜2](https://arxiv.org/html/2605.14280#Thmtheorem2)gives a general oracle inequality that separates approximation error from the source and target estimation terms, and[Section˜3](https://arxiv.org/html/2605.14280#S3)specializes this general result to sparse ReLU networks to obtain an end\-to\-end nonparametric rate\. Our bounds make explicit how the regularization parameterλ\\lambdatrades target relevance against estimation variance, and how various smoothness conditions affect the rates\.

On the empirical side, we evaluateTILTthrough a suite of numerical studies\. In controlled regression experiments with known source and target densities, we compare against source ERM, as well as the*oracle forms*of importance weighting and relative\-density\-ratio that are given knowledge of the pair\(p,q\)\(p,q\)\. We construct an experiment demonstrating thatTILTachieves*minimax\-optimal*rates in terms of covariate shift severity\. Finally, in shifted CIFAR\-100 distillation, we extend the same idea to auxiliary logits and show that target\-side tilting improves target performance over source\-only training and KD methods under image\-level covariate shift\.

Organization:[Section˜2](https://arxiv.org/html/2605.14280#S2)sets up the covariate\-shift problem and introduces theTILTobjective\.[Section˜2\.1](https://arxiv.org/html/2605.14280#S2.SS1)proves the population reweighting identity, and[Section˜2\.2](https://arxiv.org/html/2605.14280#S2.SS2)records the classification variant used in the distillation experiments\.[Section˜3](https://arxiv.org/html/2605.14280#S3)gives the main finite\-sample and ReLU\-network guarantees\.[Section˜4](https://arxiv.org/html/2605.14280#S4)evaluatesTILTon synthetic regression problems and shifted CIFAR\-100\.

### 1\.1Related work

Here we elaborate on three threads of work related to theTILTapproach; due to space constraints, we limit ourselves to those papers most directly relevant\.

Importance weighting:The standard correction for covariate shift is importance\-weighted \(IW\) risk minimization\(shimodaira2000improving;sugiyama2012density\), which rewrites target risk as a source expectation weighted byw​\(x\)=q​\(x\)/p​\(x\)w\(x\)=q\(x\)/p\(x\)\. Generalization bounds for IW methods show that difficulty depends on the magnitude and variability of the weight function\(cortes2010learning;MaPatWai23\)\. Since the density ratio is typically unknown, a large literature estimates it directly from unlabeled source and target samples, including KMM\(huang2007correcting;gretton2008covariate\), KLIEP\(sugiyama2008direct\), LSIF\(kanamori2009least\), and related convex risk formulations\(nguyen2010estimating\)\. Zhang et al\.\(zhang2020onestep\)replace the two\-stage importance\-weighting pipeline with a joint optimization over predictors and weights, but still explicitly parameterize and learn the density ratio\. Relative density\-ratio methods such as RuLSIF replace the ordinary ratio by a regularized variant so as to improve stability at the expense of bias\(yamada2011relative\)\. TheTILTmethod involves a related form of regularization, but*does not*estimate or output density ratios: the relative weights appear only after the auxiliary component is optimized out, and the optimal offsetbf∗b^\{\*\}\_\{f\}couples it with the estimateff\(cf\.[Section˜2\.1](https://arxiv.org/html/2605.14280#S2.SS1)\)\.

Domain adaptation and distributional robustness:A broader line of domain\-adaptation work seeks representations or predictors whose behavior is stable across domains\. Classical theory bounds target error using source error and a discrepancy between domains\(ben2010theory\)\. Kernel and moment\-matching approaches control distributional discrepancy through quantities such as MMD\(gretton2012kernel\), while deep adaptation methods align features or domains using adversarial losses, correlation matching, or residual transfer \(e\.g\.,ganin2016domain;long2015learning;sun2016deep;long2016unsupervised\)\. Invariant and robust prediction methods, including invariant risk minimization, risk extrapolation, and anchor regression, instead penalize predictors whose performance or residuals vary across observed environments or anchor variablesarjovsky2019invariant;krueger2021out;rothenhausler2021anchor\. All of these methods are based on a different observation model than ours: they rely on observed environments, anchors, or multiple domains that reveal how the distribution changes\. In contrast, theTILTprocedure requires only labeled source data and unlabeled target covariates, with the target covariates serving as the adaptation signal\.

Additive decompositions and residual transfer:Additive shared\-specific decompositions have been used in domain adaptation, multi\-task learning, and representation learning\. Feature augmentation methods introduce shared and domain\-specific copies of features\(daume2007frustratingly\); multi\-task models decompose parameters into shared and task\-specific components\(evgeniou2004regularized;jalali2010dirty\); and domain separation networks split representations into shared and private factors\(bousmalis2016domain\)\. Residual transfer methods adapt a source predictor by learning an additional residual component for the target domain\(kuzborskij2013stability;long2016unsupervised\), or by using residuals to reduce bias in student–teacher estimation\(yamamoto2026residual\)\. TheTILTmethod also exploits an additive decomposition, but with the opposite deployment logic: the auxiliary component is allowed to fit source\-specific residual structure while being penalized on target inputs, and is discarded when constructing the final predictor\. By design, this choice yields an implicit covariate\-shift correction rather than a shared–private architectural bias\.

## 2Problem and method formulation

We consider the unsupervised domain adaptation problem, with𝒳⊂ℝd\\mathcal\{X\}\\subset\\mathbb\{R\}^\{d\}being the input space and𝒴⊂ℝ\\mathcal\{Y\}\\subset\\mathbb\{R\}the output space\. We are given a source dataset\{\(xi,yi\)\}i=1n\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{n\}consisting ofnnlabeled examples drawn i\.i\.d\. from a source distributionℙ\\mathbb\{P\}over𝒳×𝒴\\mathcal\{X\}\\times\\mathcal\{Y\}\. Additionally, we have access to a target dataset\{x~j\}j=1m\\\{\\tilde\{x\}\_\{j\}\\\}\_\{j=1\}^\{m\}consisting ofmmunlabeled examples drawn i\.i\.d\. from the marginal distributionℚX\\mathbb\{Q\}\_\{X\}of a target distributionℚ\\mathbb\{Q\}\. In the covariate shift setting, the marginal distributions of the inputs differ \(ℙX≠ℚX\\mathbb\{P\}\_\{X\}\\neq\\mathbb\{Q\}\_\{X\}\), but the conditional distribution of the labels remains invariant:ℙ​\(Y\|X\)=ℚ​\(Y\|X\)\\mathbb\{P\}\(Y\|X\)=\\mathbb\{Q\}\(Y\|X\)\.

### 2\.1TILT: Target\-induced loss tilting

We begin by describing theTILTprocedure for real\-valued outputsy∈y\\in, and using the least\-squares loss\. In this setting, the optimal predictor is given by the regression functionf∗​\(x\)=𝔼​\[Y∣X=x\]f^\{\*\}\(x\)=\\mathbb\{E\}\[Y\\mid X=x\], and the quality of an estimateffis measured by theℚ\\mathbb\{Q\}\-target excess risk

ℰℚ2​\(f\)=𝔼ℚ​\(Y−f​\(X~\)\)2−𝔼ℚ​\(Y−f∗​\(X~\)\)2≡𝔼ℚ​\(f​\(X~\)−f∗​\(X~\)\)2\\displaystyle\\mathcal\{E\}\_\{\\mathbb\{Q\}\}^\{2\}\(f\)=\\mathbb\{E\}\_\{\\mathbb\{Q\}\}\\big\(Y\-f\(\\widetilde\{X\}\)\\big\)^\{2\}\-\\mathbb\{E\}\_\{\\mathbb\{Q\}\}\\big\(Y\-f^\{\*\}\(\\widetilde\{X\}\)\\big\)^\{2\}\\;\\equiv\\mathbb\{E\}\_\{\\mathbb\{Q\}\}\\big\(f\(\\widetilde\{X\}\)\-f^\{\*\}\(\\widetilde\{X\}\)\\big\)^\{2\}\(1\)
We introduce an additive decomposition of the formf​\(x\)\+b​\(x\)f\(x\)\+b\(x\), wheref∈ℱf\\in\\mathscr\{F\}is the predictor andb∈ℬb\\in\\mathscr\{B\}is an auxiliary function\. We propose to learnffandbbsimultaneously by minimizing the following joint empirical objective:

TILTobjective:ℒ^​\(f,b\)≔1n​∑i=1n\(f​\(xi\)−yi\+b​\(xi\)\)2\+λm​∑j=1mb2​\(x~j\)\.\\displaystyle\\mbox\{\{\{TILT\} objective:\}\}\\qquad\\widehat\{\\mathcal\{L\}\}\(f,b\)\\coloneqq\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\left\(f\(x\_\{i\}\)\-y\_\{i\}\+b\(x\_\{i\}\)\\right\)^\{2\}\+\\frac\{\\lambda\}\{m\}\\sum\_\{j=1\}^\{m\}b^\{2\}\(\\widetilde\{x\}\_\{j\}\)\.\(2\)

We jointly minimizeℒ^\\widehat\{\\mathcal\{L\}\}over\(f,b\)\(f,b\)and use theff\-component of a minimizer as the finalTILTpredictor\. Here the regularization parameterλ\>0\\lambda\>0controls the strength of bias suppression on the target domain\. Note that the objective \([2](https://arxiv.org/html/2605.14280#S2.E2)\) is simple, and remains jointly convex in the function valuesf​\(xi\)f\(x\_\{i\}\),b​\(xi\)b\(x\_\{i\}\), andb​\(x~j\)b\(\\widetilde\{x\}\_\{j\}\)\.

Theoretical under\-pinning: excess risk and optimal offset:Our first result shows that theTILTobjective \([2](https://arxiv.org/html/2605.14280#S2.E2)\) has a rigorous justification at the population level of infinite sample size\. Define the population levelTILTobjectiveℒ¯​\(f,b\)≔𝔼​\[ℒ^​\(f,b\)\]\\bar\{\\mathcal\{L\}\}\(f,b\)\\coloneqq\\mathbb\{E\}\[\\widehat\{\\mathcal\{L\}\}\(f,b\)\], where the expectation is taken over both source and target data, and note thatℒ¯\\bar\{\\mathcal\{L\}\}is minimized by the pair\(f∗,0\)\(f^\{\*\},0\), wheref∗​\(x\)=𝔼​\[Y∣X=x\]f^\{\*\}\(x\)=\\mathbb\{E\}\[Y\\mid X=x\]is the optimal predictor\. The*auxiliary excessTILT\-risk*is given by

𝒜​\(f,b\)≔ℒ¯​\(f,b\)−ℒ¯​\(f∗,0\),\\displaystyle\\mathcal\{A\}\(f,b\)\\coloneqq\\bar\{\\mathcal\{L\}\}\(f,b\)\-\\bar\{\\mathcal\{L\}\}\(f^\{\*\},0\),\(3\)which compares the loss of a pair\(f,b\)\(f,b\)relative to the oracle optimum\(f∗,0\)\(f^\{\*\},0\)\. The following result shows minimizing the auxiliary excess risk over the choice ofbbyields theλ\\lambda\-weighted error function

ℰλ2​\(f\)\\displaystyle\\mathcal\{E\}\_\{\\lambda\}^\{2\}\(f\)≔𝔼ℚ​\[p​\(x\)p​\(x\)\+λ​q​\(x\)​\(f​\(X\)−f∗​\(X\)\)2\]\\displaystyle\\coloneqq\\mathbb\{E\}\_\{\\mathbb\{Q\}\}\\Big\[\\frac\{p\(x\)\}\{p\(x\)\+\\lambda q\(x\)\}\\big\(f\(X\)\-f^\{\*\}\(X\)\\big\)^\{2\}\\Big\]\(4\)and that moreover, the optimal offset functionbf∗b^\{\*\}\_\{f\}has an especially attractive and self\-localized form\.

###### Proposition 1\(Excess risk and optimal offset\)\. For any scalarλ\>0\\lambda\>0, functionffand density pair\(p,q\)\(p,q\), we have the equivalenceinfb1λ​𝒜​\(f,b\)\\displaystyle\\inf\_\{b\}\\tfrac\{1\}\{\\lambda\}\\mathcal\{A\}\(f,b\)=ℰλ2​\(f\)\\displaystyle=\\mathcal\{E\}\_\{\\lambda\}^\{2\}\(f\)\(5a\)with the infimum achieved by theoptimal offset functionbf∗​\(x\)≔−vλ​\(x\)​\(f​\(x\)−f∗​\(x\)\)wherevλ​\(x\)≔p​\(x\)\(p​\(x\)\+λ​q​\(x\)\)\.\\displaystyle b^\{\*\}\_\{f\}\(x\)\\coloneqq\-v\_\{\\lambda\}\(x\)\(f\(x\)\-f^\{\*\}\(x\)\)\\qquad\\mbox\{where $v\_\{\\lambda\}\(x\)\\coloneqq\\frac\{p\(x\)\}\{\(p\(x\)\+\\lambda q\(x\)\)\}$\.\}\(5b\)

See[Appendix˜A](https://arxiv.org/html/2605.14280#A1)for the proof\.

[Section˜2\.1](https://arxiv.org/html/2605.14280#S2.SS1)highlights several key advantages ofTILTover classical density\-ratio\-based methods\. Correctness of profiled excess risk:Minimizing the auxiliary excess risk𝒜​\(f,b\)\\mathcal\{A\}\(f,b\)over the offset termbbdefines a profiled excess risk\. Equation \([5a](https://arxiv.org/html/2605.14280#S2.E5.1)\) shows that this profiled risk is equivalent toℰλ2​\(f\)\\mathcal\{E\}\_\{\\lambda\}^\{2\}\(f\), showing thatTILTdirectly targets the relative reweighted risk without explicit density\-ratio estimation\. Moreover, the riskℰλ2​\(f\)\\mathcal\{E\}\_\{\\lambda\}^\{2\}\(f\)from equation \([4](https://arxiv.org/html/2605.14280#S2.E4)\) converges to theℚ\\mathbb\{Q\}\-target excess risk \([1](https://arxiv.org/html/2605.14280#S2.E1)\) asλ\\lambdadecreases\.

![Refer to caption](https://arxiv.org/html/2605.14280v1/x1.png)![Refer to caption](https://arxiv.org/html/2605.14280v1/x2.png)Figure 1:Left panel: Theλ\\lambda\-smoothed density ratiowλ​\(x\)=q​\(x\)p​\(x\)\+λ​q​\(x\)w\_\{\\lambda\}\(x\)=\\frac\{q\(x\)\}\{p\(x\)\+\\lambda q\(x\)\}is very spiky, compared to the relatively flat and well\-behavedTILT\-weightvλ​\(x\)=p​\(x\)p​\(x\)\+λ​q​\(x\)v\_\{\\lambda\}\(x\)=\\frac\{p\(x\)\}\{p\(x\)\+\\lambda q\(x\)\}\. Right panel: Plots of the optimal offset functionbf∗b^\{\*\}\_\{f\}for three different choices offf: linear, quadratic and degree six polynomial approximations tof∗f^\{\*\}\. The optimal offset is much smaller and smoother thanwλw\_\{\\lambda\}; note the different scale of theyy\-axis\.Uniform boundedness:The optimal offsetbf∗b^\{\*\}\_\{f\}does*not*involve the density ratio, which can be very large or unbounded\. Instead, it involves the weightvλv\_\{\\lambda\}from equation \([5b](https://arxiv.org/html/2605.14280#S2.E5.2)\), which takes values in\[0,1\]\[0,1\]for any choice of\(p,q,λ\)\(p,q,\\lambda\)\. Thus, unlike explicit density\-ratio estimation,TILTnever requires learning an unbounded or otherwise ill\-behaved quantity\. Self\-localizing property:Equally important is that the optimal offsetbf∗b^\{\*\}\_\{f\}involves the product ofvλ​\(x\)v\_\{\\lambda\}\(x\)with the*residual prediction error*Δ​\(x\)=f​\(x\)−f∗​\(x\)\\Delta\(x\)=f\(x\)\-f^\{\*\}\(x\)\. Consequently, the offsetbf∗b^\{\*\}\_\{f\}shrinks as the prediction error decreases, so that the auxiliary task becomes easier asffimproves\. This property corresponds to a form of self\-localization: the optimal offsetbf∗b^\{\*\}\_\{f\}is correctly localized around the current prediction error\.

### 2\.2Classification extension ofTILT

Thus far, we have described theTILTmethod for the least\-squares setting\. Here we describe its extension to classification, which underlies the CIFAR\-100 distillation experiments described in[Section˜4](https://arxiv.org/html/2605.14280#S4)\. At a high level, theKD\-TILTis simply the knowledge\-distillation analogue ofTILT: it keeps the same source\-side prediction objective, while adding a target\-side regularization that suppresses the auxiliary component on unlabeled target inputs\.

For aKK\-class logistic regression problem, we write the logits asf​\(x\)\+b​\(x\)f\(x\)\+b\(x\), wheref,b:𝒳→ℝKf,b:\\mathcal\{X\}\\to\\mathbb\{R\}^\{K\}, and letτ:𝒳→ℝK\\tau:\\mathcal\{X\}\\to\\mathbb\{R\}^\{K\}be teacher logits\. Given a temperatureT\>0T\>0and a mixing weightβ∈\[0,1\]\\beta\\in\[0,1\], we define

inff,b\\displaystyle\\inf\_\{f,b\}\{1n∑i=1n\[\(1−β\)ℓCE\(yi,π\(f\(xi\)\+b\(xi\)\)\)\+βT2KL​\(πT​\(τ​\(xi\)\)∥πT​\(\(f​\(xi\)\+b​\(xi\)\)\)\)⏟KDloss\]\\displaystyle\\Biggr\\\{\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\Bigl\[\(1\-\\beta\)\\,\\ell\_\{\\mathrm\{CE\}\}\\Bigl\(y\_\{i\},\\,\\pi\\bigl\(f\(x\_\{i\}\)\+b\(x\_\{i\}\)\\bigr\)\\Bigr\)\+\\beta T^\{2\}\\underbrace\{\\mathrm\{KL\}\\Bigl\(\\pi\_\{T\}\\bigl\(\\tau\(x\_\{i\}\)\\bigr\)\\,\\Big\\\|\\,\\pi\_\{T\}\\bigl\(\(f\(x\_\{i\}\)\+b\(x\_\{i\}\)\)\\bigr\)\\Bigr\)\}\_\{\\text\{\{KD\} loss\}\}\\Bigr\]\+λm∑j=1mT2KL​\(πT​\(\(f​\(x~j\)\+b​\(x~j\)\)\)∥πT​\(f​\(x~j\)\)\)⏟TILTpenalty\},\\displaystyle\+\\frac\{\\lambda\}\{m\}\\sum\_\{j=1\}^\{m\}T^\{2\}\\underbrace\{\\mathrm\{KL\}\\Bigl\(\\pi\_\{T\}\\bigl\(\(f\(\\widetilde\{x\}\_\{j\}\)\+b\(\\widetilde\{x\}\_\{j\}\)\)\\bigr\)\\,\\Big\\\|\\,\\pi\_\{T\}\\bigl\(f\(\\widetilde\{x\}\_\{j\}\)\\bigr\)\\Bigr\)\}\_\{\\text\{\{TILT\} penalty\}\}\\Biggr\\\},\(6\)whereπ\\piis the softmax function, andπT\(⋅\)≔π\(⋅/T\)\\pi\_\{T\}\(\\cdot\)\\coloneqq\\pi\(\\cdot/T\)is the softmax at temperatureTT\. The first two terms are the standard supervised and soft distillation losses on the source domain, whereas the last term is theTILTcomponent\. In[Appendix˜D](https://arxiv.org/html/2605.14280#A4), we derive theKL\-TILTprocedure as a direct KL\-divergence analogue ofTILT, and show how theKD\-TILTobjective is obtained by mixing it with the ordinary supervised cross\-entropy loss\.

## 3Learning\-theoretic guarantees

Having laid out the population \(infinite data\) behavior of theTILTprocedure, we now to turn its finite\-sample analysis\. We give two main results:[Theorem˜2](https://arxiv.org/html/2605.14280#Thmtheorem2)is an oracle inequality on the excess risk that applies to a general pair\(ℱ,ℬ\)\(\\mathscr\{F\},\\mathscr\{B\}\)of function classes for the main and auxiliary fits, involving their metric entropies\.[Section˜3](https://arxiv.org/html/2605.14280#S3)provides an explicit rate by specializing this general result to the case of sparse deep neural nets\.

The analysis in this section applies to theTILTprocedure for the least\-squares loss with real\-valued responseY∈Y\\in, and optimal predictorf∗​\(x\)≔𝔼​\[Y∣X=x\]f^\{\*\}\(x\)\\coloneqq\\mathbb\{E\}\[Y\\mid X=x\]\. We say that a function classℱ\\mathscr\{F\}isBB\-uniformly bounded if‖f‖∞=supx∈𝒳\|f​\(x\)\|≤B\\\|f\\\|\_\{\\infty\}=\\sup\_\{x\\in\\mathcal\{X\}\}\|f\(x\)\|\\leq Bfor allf∈ℱf\\in\\mathscr\{F\}\. Throughout this section, we impose the following*standard regularity*\(SR\) conditions:

- •the function classesℱ\\mathscr\{F\}andℬ\\mathscr\{B\}are eachBB\-uniformly bounded\.
- •the response noiseY−f∗​\(x\)Y\-f^\{\*\}\(x\)is sub\-Gaussian with variance proxyσ2\\sigma^\{2\}\.

Recalling theTILTobjective functionℒ^\\widehat\{\\mathcal\{L\}\}from equation \([2](https://arxiv.org/html/2605.14280#S2.E2)\), we let\(f^,b^\)∈arg⁡min\(f,b\)∈ℱ×ℬ⁡ℒ^​\(f,b\)\(\\widehat\{f\},\\widehat\{b\}\)\\in\\arg\\min\_\{\(f,b\)\\in\\mathscr\{F\}\\times\\mathscr\{B\}\}\\widehat\{\\mathcal\{L\}\}\(f,b\)denote a joint minimizer, and the finalTILTestimator corresponds to the functionf^\\widehat\{f\}\.

Main excess risk bound:Our first result gives an oracle inequality for the expected reweighted excess risk\. The approximation term measures how wellℱ\\mathscr\{F\}andℬ\\mathscr\{B\}realize the population decomposition, while the estimation error is quantified in terms of metric entropy\. More precisely, for anyδ\>0\\delta\>0, letlog⁡𝒩n≔log⁡𝒩n​\(δ;ℱ\+ℬ;L∞\)\\log\\mathcal\{N\}\_\{n\}\\coloneqq\\log\\mathcal\{N\}\_\{n\}\(\\delta;\\mathscr\{F\}\+\\mathscr\{B\};L\_\{\\infty\}\)andlog⁡𝒩m≔log⁡𝒩m​\(δ;ℬ;L∞\)\\log\\mathcal\{N\}\_\{m\}\\coloneqq\\log\\mathcal\{N\}\_\{m\}\(\\delta;\\mathscr\{B\};L\_\{\\infty\}\)denote the metric entropy at scaleδ\\delta\. The following theorem gives a bound in terms of the error metricℰλ2​\(f\)\\mathcal\{E\}\_\{\\lambda\}^\{2\}\(f\)from equation \([5a](https://arxiv.org/html/2605.14280#S2.E5.1)\)\.

###### Theorem 2\(Non\-asymptotic bound onℰλ2​\(f^\)\\mathcal\{E\}\_\{\\lambda\}^\{2\}\(\\widehat\{f\}\)\)\. Under the SR conditions, there exists a universal constantc\>0c\>0such that for anys∈\(0,1\)s\\in\(0,1\)and anyδ∈\(0,1\)\\delta\\in\(0,1\), theTILTestimatef^\\widehat\{f\}satisfies𝔼​\[ℰλ2​\(f^\)\]\\displaystyle\\mathbb\{E\}\\big\[\\mathcal\{E\}\_\{\\lambda\}^\{2\}\(\\widehat\{f\}\)\\big\]≤\(1\+s\)​inff∈ℱ,b∈ℬ1λ​𝒜​\(f,b\)⏟Approximation error\+1s​c​\(B2\+σ2\)​\(log⁡𝒩nλ​n\+log⁡𝒩mm\)⏟Estimation error\+Rem⁡\(δ\),\\displaystyle\\leq\(1\+s\)\\underbrace\{\\inf\_\{f\\in\\mathscr\{F\},b\\in\\mathscr\{B\}\}\\tfrac\{1\}\{\\lambda\}\\mathcal\{A\}\(f,b\)\}\_\{\\mbox\{Approximation error\}\}\+\\frac\{1\}\{s\}\\underbrace\{c\(B^\{2\}\+\\sigma^\{2\}\)\\left\(\\frac\{\\log\\mathcal\{N\}\_\{n\}\}\{\\lambda n\}\+\\frac\{\\log\\mathcal\{N\}\_\{m\}\}\{m\}\\right\)\}\_\{\\mbox\{Estimation error\}\}\+\\operatorname\{Rem\}\(\\delta\),\(7\)whereRem⁡\(δ\)=c′λ​\(B\+σ\)​δ=𝒪​\(δ\)\\operatorname\{Rem\}\(\\delta\)=\\frac\{c^\{\\prime\}\}\{\\lambda\}\(B\+\\sigma\)\\delta=\\mathcal\{O\}\(\\delta\)is a remainder term\.

See[Appendix˜B](https://arxiv.org/html/2605.14280#A2)for the proof\. A few comments on this result are in order:

- •The first term in the bound \([7](https://arxiv.org/html/2605.14280#S3.E7)\) represents approximation error, as measured by the best performing pair\(f,b\)\(f,b\)for the auxiliary excess risk𝒜\\mathcal\{A\}from equation \([3](https://arxiv.org/html/2605.14280#S2.E3)\)\. From[Section˜2\.1](https://arxiv.org/html/2605.14280#S2.SS1), if the auxiliary classℬ\\mathscr\{B\}is rich enough to includebf∗b^\{\*\}\_\{f\}\(cf\. equation \([5b](https://arxiv.org/html/2605.14280#S2.E5.2)\)\) for eachf∈ℱf\\in\\mathscr\{F\}, then this approximation error takes the forminff∈ℱℰλ2​\(f\)\\inf\_\{f\\in\\mathscr\{F\}\}\\mathcal\{E\}\_\{\\lambda\}^\{2\}\(f\), which is directly comparable to the left\-hand side\.
- •The parameterλ\\lambdacontrols the bias–variance tradeoff: smallerλ\\lambdamoves the criterion toward theL2​\(ℚ\)L^\{2\}\(\\mathbb\{Q\}\)\-target riskℰ0​\(f\)≡𝔼ℚ​\[\(f​\(X\)−f∗​\(X\)\)2\]\\mathcal\{E\}\_\{0\}\(f\)\\equiv\\mathbb\{E\}\_\{\\mathbb\{Q\}\}\\big\[\(f\(X\)\-f^\{\*\}\(X\)\)^\{2\}\\big\], while the bound pays for this through the source complexity termlog⁡𝒩n\\log\\mathcal\{N\}\_\{n\}scaled by1/λ1/\\lambda\.

ReLU network excess risk bounds:[Theorem˜2](https://arxiv.org/html/2605.14280#Thmtheorem2)is a general result that applies to anyBB\-bounded pair of function classes\(ℱ,ℬ\)\(\\mathscr\{F\},\\mathscr\{B\}\)\. We can obtain explicit rates of convergence by specializing it to particular classes, and here we describe such a result for sparse clipped ReLU networks\. Our analysis converts the general oracle inequality \([7](https://arxiv.org/html/2605.14280#S3.E7)\) into an explicit nonparametric rate for this network class\.

Let𝙽𝙽​\(L,𝐩,s,B\)\\mathtt\{NN\}\(L,\\mathbf\{p\},s,B\)denote the class of neural networks withLLhidden layers, width vector𝐩\\mathbf\{p\}, at mostssnonzero parameters, and final output bounded byBB; see[Section˜C\.1](https://arxiv.org/html/2605.14280#A3.SS1)in[Appendix˜C](https://arxiv.org/html/2605.14280#A3)for details\. For pairs\(𝐩f,sf\)\(\\mathbf\{p\}\_\{f\},s\_\{f\}\)and\(𝐩g,sg\)\(\\mathbf\{p\}\_\{g\},s\_\{g\}\)to be chosen, our theory applies to the pair

Main class:ℱ=𝙽𝙽​\(Lf,𝐩f,sf,B\),Auxiliary classℬ=𝙽𝙽​\(Lg,𝐩g,sg,B\)\.\\displaystyle\\mbox\{\\lx@text@underline\{Main class:\}\}\\quad\\mathscr\{F\}=\\mathtt\{NN\}\(L\_\{f\},\\mathbf\{p\}\_\{f\},s\_\{f\},B\),\\qquad\\mbox\{\\lx@text@underline\{Auxiliary class\}\}\\quad\\mathscr\{B\}=\\mathtt\{NN\}\(L\_\{g\},\\mathbf\{p\}\_\{g\},s\_\{g\},B\)\.\(8a\)As for the data\-generating model, we impose Hölder smoothness conditions on the true regression functionf∗:d→f^\{\*\}:^\{d\}\\rightarrow, and the bounded weightvλ​\(x\)=p​\(x\)p​\(x\)\+λ​q​\(x\)v\_\{\\lambda\}\(x\)=\\frac\{p\(x\)\}\{p\(x\)\+\\lambda q\(x\)\}\. In particular, for smoothness parametersβ,γ\>0\\beta,\\gamma\>0and a radiusKK, we assume thatf∗∈Cdβ​\(\[0,1\]d,K\),andvλ=p/\(p\+λ​q\)∈Cdγ​\(\[0,1\]d,K\),\\displaystyle f^\{\*\}\\in C\_\{d\}^\{\\beta\}\(\[0,1\]^\{d\},K\),\\quad\\mbox\{and\}\\quad v\_\{\\lambda\}=p/\(p\+\\lambda q\)\\in C\_\{d\}^\{\\gamma\}\(\[0,1\]^\{d\},K\),\(8b\)and moreover thatf∗f^\{\*\}isKK\-bounded withK≤BK\\leq B\.See[Section˜C\.1](https://arxiv.org/html/2605.14280#A3.SS1)in[Appendix˜C](https://arxiv.org/html/2605.14280#A3)for details of the Hölder ballCdα​\(\[0,1\]d,K\)C\_\{d\}^\{\\alpha\}\(\[0,1\]^\{d\},K\)\.

###### Corollary 3\(Excess risk bound for neural net classes\)\. Under conditions \([8b](https://arxiv.org/html/2605.14280#S3.E8.2)\) on the pair\(f∗,vλ\)\(f^\{\*\},v\_\{\\lambda\}\), consider theTILTprocedure implemented with the neural net classes \([8a](https://arxiv.org/html/2605.14280#S3.E8.1)\) with parameterssf\\displaystyle s\_\{f\}≍\(κλ​nλ\)d2​β\+d,sg≍sf\+\(κλ​nλ\)d2​γ\+d,Lf≍Lg≍log⁡\(κλ​nλ\)\.\\displaystyle\\asymp\\big\(\\kappa\_\{\\lambda\}n\_\{\\lambda\}\\big\)^\{\\frac\{d\}\{2\\beta\+d\}\},\\qquad s\_\{g\}\\asymp s\_\{f\}\+\\big\(\\kappa\_\{\\lambda\}n\_\{\\lambda\}\\big\)^\{\\frac\{d\}\{2\\gamma\+d\}\},\\qquad L\_\{f\}\\asymp L\_\{g\}\\asymp\\log\(\\kappa\_\{\\lambda\}n\_\{\\lambda\}\)\.\(9a\)wherenλ≔min⁡\{λ​n,m\}n\_\{\\lambda\}\\coloneqq\\min\\\{\\lambda n,m\\\}, andκλ≔\(1\+λ\)/λ\\kappa\_\{\\lambda\}\\coloneqq\(1\+\\lambda\)/\\lambda\. Then theTILTestimatef^\\widehat\{f\}satisfies the following risk bound, with the displayedλ\\lambda\-dependence entering throughnλn\_\{\\lambda\}andκλ\\kappa\_\{\\lambda\}:𝔼​ℰλ2​\(f^\)\\displaystyle\\mathbb\{E\}\\mathcal\{E\}\_\{\\lambda\}^\{2\}\(\\widehat\{f\}\)≤c0​\(K2\+σ2\)​\{κλd2​β\+d​nλ−2​β2​β\+d\+κλd2​γ\+d​nλ−2​γ2​γ\+d\}​\(log⁡\(κλ​nλ\)\)3\\displaystyle\\leq c\_\{0\}\(K^\{2\}\+\\sigma^\{2\}\)\\Big\\\{\\kappa\_\{\\lambda\}^\{\\frac\{d\}\{2\\beta\+d\}\}n\_\{\\lambda\}^\{\-\\frac\{2\\beta\}\{2\\beta\+d\}\}\+\\kappa\_\{\\lambda\}^\{\\frac\{d\}\{2\\gamma\+d\}\}n\_\{\\lambda\}^\{\-\\frac\{2\\gamma\}\{2\\gamma\+d\}\}\\Big\\\}\\;\\big\(\\log\(\\kappa\_\{\\lambda\}n\_\{\\lambda\}\)\\big\)^\{3\}\(9b\)Herec0c\_\{0\}may depend ond,β,γ,Bd,\\beta,\\gamma,B, but not onλ,n\\lambda,normm\.

The effective sample size isnλn\_\{\\lambda\}because the auxiliary class is controlled by both the source and target covering terms\. The source term is the usual approximation\-estimation rate for theβ\\beta\-smooth predictorf∗f^\{\*\}, while the target term is the corresponding rate for theγ\\gamma\-smooth functionvλv\_\{\\lambda\}controlling the auxiliary correction\. A key property is that the result*does not*require the raw111In addition, theλ\\lambda\-regularized versionq/\(p\+λ​q\)q/\(p\+\\lambda q\)is bounded, but can take values as large as1/λ1/\\lambda, in contrast to the uniform bound‖vλ‖∞≤1\\\|v\_\{\\lambda\}\\\|\_\{\\infty\}\\leq 1\.density ratioq/pq/pto be bounded or smooth\.

## 4Numerical results

### 4\.1Synthetic regression experiments

The synthetic experiments are designed to isolate the mechanism ofTILTin settings where the source and target distributions are known\. For a new loss\-tilting mechanism, this control is essential: it exposes the behavior of the objective apart from density\-ratio estimation error\. We use a Beta covariate shift in one dimension and a Gaussian covariate shift in higher dimension, followed by a point\-mass nonparametric rate experiment\. We compare source ERM, exact importance weighting \(IW\), exact relative least\-squares importance fitting \(RuLSIF\), kernel\-estimatedRuLSIFwhen applicable, andTILT\. The exactIWandRuLSIFbaselines use the known source and target distributions; no density\-ratio estimation or clipping is used for these exact baselines\.

Linear misspecification:In the linear experiment, the target predictorffis a degree\-three linear model in shifted orthonormal Legendre features, while the auxiliary componentbbis a Gaussian\-kernel ridge expansion on uniform centers\. The target distribution is fixed at a Beta\(2,5\)\(2,5\)endpoint and the source distribution is swept from the matched case toward a Beta\(5,2\)\(5,2\)endpoint over 21 corruption levels\. For each level we draw320320source samples,320320target samples, and12,00012\{,\}000test points, and report means over100100trials\. All least\-squares subproblems are solved by ridge regression, andλ\\lambdais swept on a logarithmic grid from10−610^\{\-6\}to10410^\{4\}\. The response contains a degree\-three base component plus a localized high\-frequency residual, chosen deliberately so that source ERM with the misspecified linear class extrapolates poorly into the target region\.

From[Figure˜2](https://arxiv.org/html/2605.14280#S4.F2)A, the exactIWdegrades, and exactRuLSIFimproves overIWafter corruption level0\.50\.5, butTILTremains the best method shown throughout the sweep\. PanelBshows the sameTILTestimator as a function ofλ\\lambdaunder a larger shift: the best values move to an intermediate regime, visualizing the bias–variance tradeoff induced by theTILTobjective\. Asλ→∞\\lambda\\to\\infty, the target penalty suppresses the auxiliary componentbb, and the estimator approaches the source\-ERM reference\. For intermediate values ofλ\\lambda, the target\-side correction improves performance without exhibiting the instability suggested by the1/λ1/\\lambdasource\-complexity term in the bound\.

##### High\-dimensional neural misspecification

We next test a high\-dimensional Gaussian covariate\-shift problem in which the deployed target class is deliberately misspecified\. The target distribution is fixed atN​\(0,I4096\)N\(0,I\_\{4096\}\), while the source distribution interpolates from the matched case toN​\(μ,I4096\)N\(\\mu,I\_\{4096\}\), where‖μ‖2=3\\\|\\mu\\\|\_\{2\}=3and the shift lies in a512512\-dimensional coordinate subspace\. The regression function contains many target\-relevant ridge components together with a source\-local residual, so a weak predictor cannot represent the true conditional mean\. Across all methods the deployed predictorffis the same one\-hidden\-layer ReLU network of width1616, with effective weight decay fixed at10−510^\{\-5\}\.TILTuses only an additional auxiliary networkbbwith three hidden layers of width256256\.

For each shift level we use10241024labeled source samples,10241024unlabeled target samples for theTILTpenalty and density\-ratio estimation,128128labeled target validation samples for model selection, and an independent target test set of size81928192, averaged over2020trials\. The validation labels are used only to select hyperparameters\. Source ERM andIWselect the early\-stopping epoch from\{1100,2200,4400\}\\\{1100,2200,4400\\\}\. ExactRuLSIFand kernel\-estimatedRuLSIFselectλ∈\{0\.1,1,10,100\}\\lambda\\in\\\{0\.1,1,10,100\\\}and the same early\-stopping grid\.TILTselectsλ∈\{0\.1,1,10,100\}\\lambda\\in\\\{0\.1,1,10,100\\\}, the auxiliary effective weight decay forbbfrom\{10−7,10−6,10−5\}\\\{10^\{\-7\},10^\{\-6\},10^\{\-5\}\\\}, and the same early\-stopping grid\. The architecture and effective weight decay offfare fixed throughout\.

[Figure˜2](https://arxiv.org/html/2605.14280#S4.F2)C compares the resulting target\-test MSEs after target\-validation tuning\. At zero shift, source ERM, exactIW, andTILTessentially coincide, as they should in the matched distribution\. As the shift increases, exactIWdeteriorates rapidly because the ordinary density ratio becomes increasingly variable\.TILTremains the most stable method across the positive shift levels and gives the clearest gains as the shift grows\.

\\begin\{overpic\}\[width=433\.62pt\]\{figs/toy\_tilt\_linear\_ab\_panel\_a\.pdf\} \\put\(1\.0,94\.0\)\{\\scriptsize A\} \\end\{overpic\}

\\begin\{overpic\}\[width=433\.62pt\]\{figs/toy\_tilt\_linear\_ab\_panel\_c\.pdf\} \\put\(1\.0,94\.0\)\{\\scriptsize B\} \\end\{overpic\}

\\begin\{overpic\}\[width=433\.62pt\]\{figs/toy\_highdim\_target\_val\_tuned\_seed20\_13453586\.pdf\} \\put\(1\.0,94\.0\)\{\\scriptsize C\} \\end\{overpic\}

Figure 2:TILTimproves target\-domain regression under covariate shift\.AIn the one\-dimensional misspecified linear\-regression problem, target\-test MSE is reported as the source distribution moves away from the fixed target distribution\. Exact importance weighting \(IW\), exact relative least\-squares importance fitting \(RuLSIF\), andTILTare shown; source ERM is omitted for scale\.BIn the same linear setting at corruption level0\.70\.7, the target\-test MSE ofTILTis plotted as a function ofλ\\lambda, with the source ERM reference shown for comparison\.CIn the40964096\-dimensional neural problem, target\-test MSE is reported as the Gaussian source mean moves away from the fixed target distribution\. All methods inCuse the same weak ReLU target class and the same effective weight decay for the deployed predictorff; a small labeled target validation set is used only to tune allowed hyperparameters\. Curves show means over trials and shaded bands show interquartile ranges\.Minimax point\-mass lower\-bound experiment:Finally, we test whetherTILTrecovers the effective\-sample\-size rate predicted by bounded\-density\-ratio minimax lower bounds for nonparametric covariate shift\(pathak2022new\)\. We takeℚX=Unif​\[0,1\]\\mathbb\{Q\}\_\{X\}=\\mathrm\{Unif\}\[0,1\]andℙL=L−1​Unif​\[0,1\]\+\(1−L−1\)​δ0\\mathbb\{P\}\_\{L\}=L^\{\-1\}\\mathrm\{Unif\}\[0,1\]\+\(1\-L^\{\-1\}\)\\delta\_\{0\}, so‖d​q/d​pL‖∞=L\\\|dq/dp\_\{L\}\\\|\_\{\\infty\}=Lis tight\. This is a worst\-case shift for the sine\-feature model because all sine features vanish atx=0x=0, i\.e\.,f​\(0\)=0f\(0\)=0, making atom samples uninformative about the target function\. The true regression function is

f∗​\(x\)=∑k≥1θk​2​sin⁡\(π​k​x\),withθk≍\(−1\)k−1​k−\(β\+1/2\),\\displaystyle f^\{\*\}\(x\)=\\sum\_\{k\\geq 1\}\\theta\_\{k\}\\sqrt\{2\}\\sin\(\\pi kx\),\\qquad\\mbox\{with\}\\quad\\theta\_\{k\}\\asymp\(\-1\)^\{k\-1\}k^\{\-\(\\beta\+1/2\)\},withβ=2\\beta=2\. Source responses are observed with additive Gaussian noise,Y=f∗​\(X\)\+ξY=f^\{\*\}\(X\)\+\\xiwithξ∼N​\(0,0\.22\)\\xi\\sim N\(0,0\.2^\{2\}\), and we report target MSE against the noiseless regression functionf∗f^\{\*\}\. We use the firstDF≍\(n/L\)1/\(2​β\+1\)D\_\{F\}\\asymp\(n/L\)^\{1/\(2\\beta\+1\)\}sine features forℱ\\mathscr\{F\}, a richer sine span forℬ⊇ℱ\\mathscr\{B\}\\supseteq\\mathscr\{F\}, and tune bothℬ\\mathscr\{B\}andλ\\lambdaoracle\-wise\. In this case, givend=1d=1andβ=2\.0\\beta=2\.0, the minimax target\-MSE rate is\(n/L\)−2​β2​β\+1=\(n/L\)−4/5\(n/L\)^\{\-\\frac\{2\\beta\}\{2\\beta\+1\}\}=\(n/L\)^\{\-4/5\}\(MaPatWai23\)\.

![Refer to caption](https://arxiv.org/html/2605.14280v1/x3.png)Figure 3:Point\-mass nonparametric rate\.Left: oracle\-tunedTILTunderℙL=L−1​Unif​\[0,1\]\+\(1−L−1\)​δ0\\mathbb\{P\}\_\{L\}=L^\{\-1\}\\mathrm\{Unif\}\[0,1\]\+\(1\-L^\{\-1\}\)\\delta\_\{0\}for aβ=2\\beta=2sine\-series regression function\. Right: source ERM on the samen/Ln/Laxis ascends asLLincreases\. The dotted line has slope\(n/L\)−4/5\(n/L\)^\{\-4/5\}\.[Figure˜3](https://arxiv.org/html/2605.14280#S4.F3)comparesTILTwith source ERM on the samen/Ln/Laxis\.TILTaligns with the\(n/L\)−4/5\(n/L\)^\{\-4/5\}theory slope, while source ERM does not align because it scales asL​n−4/5Ln^\{\-4/5\}in this construction\. SinceL4/5<LL^\{4/5\}<LforL\>1L\>1,TILT’s minimax scaling is better than the linear dependence onLLof source ERM in the shifted regimes, showing that theTILTobjective can effectively mitigate the worst\-case effect of the large density ratio in this problem\.

### 4\.2Covariate shift in CIFAR\-100

##### Separation betweenTILTandKD

For CIFAR\-100 we use a target\-side image corruption intended to mimic poor acquisition conditions: as the target shift grows, images become darker, more contrasted, less saturated, and increasingly blurred as shown in[Figure˜4](https://arxiv.org/html/2605.14280#S4.F4)A\. The source domain is clean CIFAR\-100\. A ResNet\-20 teacher is trained from scratch on the source, and the tilted variants use a ResNet\-20 auxiliary logit modelbb, also without pretraining\. The deployed target/student modelffis an intentionally weak CNN with about1\.4×1041\.4\\times 10^\{4\}parameters, whereas ResNet\-20 has about2\.8×1052\.8\\times 10^\{5\}parameters, roughly a2020\-fold gap\. This makes the student misspecified enough that covariate shift has a visible effect\. Becausebbis added inside the logits, we mean\-center the auxiliary logits across classes so that their class\-wise sum is zero and the decomposition does not collapse through an arbitrary logit shift\.

We compare source ERM, vanilla knowledge distillation \(KD\)hinton2015distilling, and the two tilted variantsKD\-TILTandKL\-TILT\. The teacher is trained for160160epochs and students for100100epochs using SGD with momentum; distillation uses temperatureT=2\.0T=2\.0and mixing weightβ=0\.5\\beta=0\.5\. Target shift strength is swept over\{0,0\.33,0\.66,1\.0,1\.33,1\.66\}\\\{0,0\.33,0\.66,1\.0,1\.33,1\.66\\\}, omitting the intermediate0\.830\.83run\. The method\-comparison curves aggregate ten seeds, and theλ\\lambda\-sensitivity curve aggregates sevenKL\-TILTseeds\. Theλ\\lambda\-sensitivity panel uses separateKL\-TILTruns at shift strengths0and1\.51\.5\. The penalty strengthλ\\lambdais selected per shift strength using labeled target validation data, from the swept grids forKD\-TILTandKL\-TILT\.

[Figure˜4](https://arxiv.org/html/2605.14280#S4.F4)shows that a small amount of corruption is enough for the tilted distillation objectives to separate from the baselines\. In[Figure˜4](https://arxiv.org/html/2605.14280#S4.F4)B,KD\-TILTandKL\-TILTare close to the baselines at small shifts but overtake source ERM and vanillaKDonce the shift strength becomes moderate\. The target\-test cross\-entropy in[Figure˜4](https://arxiv.org/html/2605.14280#S4.F4)C makes the same effect more pronounced\. At zero shift,KL\-TILTpays a small cost relative to source ERM and vanillaKD, whileKD\-TILTremains comparable, as expected when little target\-side debiasing is needed\. As the shift grows, bothKD\-TILTandKL\-TILTobtain much lower loss than source ERM and vanillaKD\.KD\-TILTis consistently slightly better thanKL\-TILT, butKL\-TILTis also stable and substantially improves over the non\-tilted baselines\. These trends suggest that target\-side debiasing can compensate for a large capacity gap even when the auxiliary model has the same architecture as the teacher\.

[Figure˜4](https://arxiv.org/html/2605.14280#S4.F4)D plots the target\-test cross\-entropy ofKL\-TILTas a function ofλ\\lambdaat small and large target shifts\. The finite curves are relatively flat across a broad intermediate range ofλ\\lambdain both regimes, showing that the method is not sensitive to precise tuning within that range\. At the same time, performance degrades or becomes numerically unstable whenλ\\lambdais made extremely small or extremely large, matching the synthetic experiments: useful target\-side debiasing requires an intermediate amount of regularization rather than either pure importance weighting or the source\-ERM limit\. The dashed source\-ERM references show that, under the larger shift, this robust range remains well below the source\-only loss\.

\\begin\{overpic\}\[width=433\.62pt\]\{figs/cifar100\_target\_corruption\_transition\.pdf\} \\put\(\-2\.0,21\.0\)\{\\hbox\{\\pagecolor\{white\}\\scriptsize A\}\} \\end\{overpic\}

\\begin\{overpic\}\[width=433\.62pt\]\{figs/cifar100\_kl\-tilt\_triple\_panel\_b\.pdf\} \\put\(1\.0,94\.0\)\{\\scriptsize B\} \\end\{overpic\}

\\begin\{overpic\}\[width=433\.62pt\]\{figs/cifar100\_kl\-tilt\_triple\_panel\_c\.pdf\} \\put\(1\.0,94\.0\)\{\\scriptsize C\} \\end\{overpic\}

\\begin\{overpic\}\[width=433\.62pt\]\{figs/cifar100\_kl\-tilt\_triple\_panel\_d\.pdf\} \\put\(1\.0,94\.0\)\{\\scriptsize D\} \\end\{overpic\}

Figure 4:KD\-TILTandKL\-TILTconsistently improve shifted\-test performance over source\-only baselines on CIFAR\-100\.A ResNet\-20 teacher is trained on clean CIFAR\-100 \(source\), and a weak CNN student is distilled under covariate shift induced by a deterministic combination of brightness, contrast, color, and Gaussian\-blur perturbations on the target domain, parameterized by the target shift strength on the horizontal axis\.Ashows the deterministic target\-shift path for one representative CIFAR\-100 test image with the label “rocket”; the leftmost image is the clean source image at shift strength0\.Breports top\-1 accuracy on the shifted target\-test set;Creports the corresponding target\-test cross\-entropy loss\.Dshows the target\-test cross\-entropy ofKL\-TILTas a function ofλ\\lambdaat shift strengths0and1\.51\.5; dashed horizontal lines mark the corresponding source\-ERM losses, and missing large\-λ\\lambdapoints indicate NaN training runs\. We compareSource ERM\(student trained with cross\-entropy on source only\),KD\(vanilla knowledge distillation\),KD\-TILT\(KD with theTILTtarget\-side regularizer\), andKL\-TILT\(the KL\-divergence tilted variant\)\. The penalty strengthλ\\lambdaforKD\-TILTandKL\-TILTis selected per shift strength on a labeled target\-validation split\. Solid curves show means over seeds and shaded bands indicate±1\\pm 1standard deviation;BandCuse ten seeds, whileDuses seven seeds\. BothTILTvariants clearly dominate Source ERM andKDat moderate\-to\-high shift strengths, and theKL\-TILTloss curve remains robust over a broad intermediate range ofλ\\lambdavalues\.

## 5Discussion

We introducedTILT, a one\-step method for unsupervised covariate\-shift adaptation that uses unlabeled target covariates to regularize an auxiliary component rather than to estimate an explicit density ratio\. The main conceptual point is that a simple additive decomposition, together with a target\-side penalty on the auxiliary term, induces a profiled objective equivalent to a relative weighted target excess risk\. Instead of learning a potentially unstable ratio, the auxiliary component learns a bounded, error\-localized correction\.

Several directions for future work remain\. First, the current finite\-sample theory applies to least\-squares regression under covariate shift, with bounded function classes and sub\-Gaussian noise\. The classification variants follow the same algorithmic principle, but extending the finite\-sample guarantees to general Bregman losses remains to be done\. Second, the method introduces an auxiliary model and at least one regularization parameter\. AlthoughTILTavoids explicit density\-ratio estimation, it does not eliminate model\-selection issues, includingλ\\lambdaand the function classes\(ℱ,ℬ\)\(\\mathscr\{F\},\\mathscr\{B\}\)\. Finally, the empirical evaluation is still controlled\. The synthetic experiments isolate the mechanism, and the CIFAR\-100 shift is generated by deterministic image corruptions\. These experiments show thatTILTcan work under substantial covariate shift, but broader evaluation on natural distribution shifts is needed to characterize its practical failure modes\.

#### Acknowledgements

This work was partially funded by National Science Foundation Grant NSF DMS\-2311072; Office of Naval Reseach ONR Grant N00014026\-1\-2116, and the Ford Professorship to MJW\. KY was supported by the Takenaka Scholarship Foundation\.

\\PrintPaperBibliography

Appendix for “TILT: Target\-induced loss tilting under covariate shift”

## Appendix AProof of[Section˜2\.1](https://arxiv.org/html/2605.14280#S2.SS1)

Here we prove the more general decomposition result

𝒜​\(f,b\)\\displaystyle\\mathcal\{A\}\(f,b\)=λ​ℰλ2​\(f\)\+\(1\+λ\)​‖b−bf∗‖ℙλ2\\displaystyle=\\lambda\\mathcal\{E\}\_\{\\lambda\}^\{2\}\(f\)\+\(1\+\\lambda\)\\\|b\-b^\{\*\}\_\{f\}\\\|\_\{\\mathbb\{P\}\_\{\\lambda\}\}^\{2\}\(10\)whereℙλ\\mathbb\{P\}\_\{\\lambda\}is the distribution with densityp\+λ​q1\+λ\\frac\{p\+\\lambda q\}\{1\+\\lambda\}\. Both claims in[Section˜2\.1](https://arxiv.org/html/2605.14280#S2.SS1)follow as a direct consequence of this representation\.

By definition, the auxiliary risk is given by

𝒜​\(f,b\)\\displaystyle\\mathcal\{A\}\(f,b\)=𝔼ℙ​\[\(f​\(X\)−f∗​\(X\)\+b​\(X\)\)2\]\+λ​𝔼ℚ​\[b2​\(X\)\]=𝔼ℙ​\[\(h​\(X\)\+b​\(X\)\)2\]\+λ​𝔼ℚ​\[b2​\(X\)\],\\displaystyle=\\mathbb\{E\}\_\{\\mathbb\{P\}\}\\big\[\(f\(X\)\-f^\{\*\}\(X\)\+b\(X\)\)^\{2\}\\Big\]\+\\lambda\\mathbb\{E\}\_\{\\mathbb\{Q\}\}\[b^\{2\}\(X\)\]\\;=\\;\\mathbb\{E\}\_\{\\mathbb\{P\}\}\\big\[\(h\(X\)\+b\(X\)\)^\{2\}\\Big\]\+\\lambda\\mathbb\{E\}\_\{\\mathbb\{Q\}\}\[b^\{2\}\(X\)\],where we have introduced the shorthand notationh​\(x\)≔f​\(x\)−f∗​\(x\)h\(x\)\\coloneqq f\(x\)\-f^\{\*\}\(x\)\. Next we write the expectations in terms of the densitiesppandqq, and expand the square, thereby obtaining

𝒜​\(f,b\)\\displaystyle\\mathcal\{A\}\(f,b\)=∫\(h​\(x\)2​p​\(x\)\+2​h​\(x\)​b​\(x\)​p​\(x\)\+b2​\(x\)​\(p​\(x\)\+λ​q​\(x\)\)\)⏟≡H​\(x\)​𝑑x\.\\displaystyle=\\int\\underbrace\{\\left\(h\(x\)^\{2\}p\(x\)\+2h\(x\)b\(x\)p\(x\)\+b^\{2\}\(x\)\(p\(x\)\+\\lambda q\(x\)\)\\right\)\}\_\{\\equiv H\(x\)\}dx\.We complete the square with respect tob​\(x\)b\(x\)weighted byp​\(x\)​λ​q​\(x\)p\(x\)\\lambda q\(x\)\. By doing so, the integrandHHcan be decomposed asH​\(x\)=H1​\(x\)\+H2​\(x\)H\(x\)=H\_\{1\}\(x\)\+H\_\{2\}\(x\), where

H1​\(x\)\\displaystyle H\_\{1\}\(x\)≔\(p​\(x\)−p2​\(x\)p​\(x\)\+λ​q​\(x\)\)​h2​\(x\),and\\displaystyle\\coloneqq\\left\(p\(x\)\-\\frac\{p^\{2\}\(x\)\}\{p\(x\)\+\\lambda q\(x\)\}\\right\)h^\{2\}\(x\),\\quad\\mbox\{and\}H2​\(x\)\\displaystyle H\_\{2\}\(x\)≔\(p​\(x\)\+λ​q​\(x\)\)​\(b​\(x\)\+p​\(x\)p​\(x\)\+λ​q​\(x\)​h​\(x\)\)2,\\displaystyle\\coloneqq\(p\(x\)\+\\lambda q\(x\)\)\\left\(b\(x\)\+\\frac\{p\(x\)\}\{p\(x\)\+\\lambda q\(x\)\}h\(x\)\\right\)^\{2\},By inspection, we see that∫H2​\(x\)​𝑑x\\int H\_\{2\}\(x\)dxis equal to\(1\+λ\)​‖b−bf∗‖ℙλ2\(1\+\\lambda\)\\\|b\-b^\{\*\}\_\{f\}\\\|\_\{\\mathbb\{P\}\_\{\\lambda\}\}^\{2\}\. As for the termH1H\_\{1\}, we have

H1​\(x\)=\(p​\(x\)−\(p\)2p​\(x\)\+λ​q​\(x\)\)​h2​\(x\)\\displaystyle H\_\{1\}\(x\)=\\left\(p\(x\)\-\\frac\{\(p\)^\{2\}\}\{p\(x\)\+\\lambda q\(x\)\}\\right\)h^\{2\}\(x\)=\(p​\(p\+λ​q​\(x\)\)−p2​\(x\)p​\(x\)\+λ​q​\(x\)\)​h2​\(x\)\\displaystyle=\\left\(\\frac\{p\(p\+\\lambda q\(x\)\)\-p^\{2\}\(x\)\}\{p\(x\)\+\\lambda q\(x\)\}\\right\)h^\{2\}\(x\)=\(λ​p​\(x\)​q​\(x\)p​\(x\)\+λ​q​\(x\)\)​h2​\(x\)=λ​vλ​\(x\)​h2​\(x\)​q​\(x\)\.\\displaystyle=\\left\(\\frac\{\\lambda p\(x\)q\(x\)\}\{p\(x\)\+\\lambda q\(x\)\}\\right\)h^\{2\}\(x\)=\\lambda v\_\{\\lambda\}\(x\)h^\{2\}\(x\)q\(x\)\.Consequently, we have∫H1​\(x\)​𝑑x=λ​𝔼ℚ​\[vλ​\(X\)​h2​\(X\)\]=λ​ℰλ2​\(f\)\\int H\_\{1\}\(x\)dx=\\lambda\\mathbb\{E\}\_\{\\mathbb\{Q\}\}\[v\_\{\\lambda\}\(X\)h^\{2\}\(X\)\]=\\lambda\\mathcal\{E\}\_\{\\lambda\}^\{2\}\(f\), which completes the proof\.

## Appendix BProof of[Theorem˜2](https://arxiv.org/html/2605.14280#Thmtheorem2)

We now turn to the proof of[Theorem˜2](https://arxiv.org/html/2605.14280#Thmtheorem2)\. For the reader’s convenience, we summarize some notation and then provide an overall roadmap of the proof structure\.

##### Notation:

Throughout the appendix proofs, we use𝔼\\mathbb\{E\}to denote expectation over all covariates \(both source and target\), as well as over source\-response noise\. For notational compactness, we write𝔼​𝒜\\mathbb\{E\}\\,\\mathcal\{A\}and𝔼​𝒜^\\mathbb\{E\}\\,\\widehat\{\\mathcal\{A\}\}for the corresponding expected population and ideal empirical risks\. The ideal empirical risk can be decomposed as

𝒜^​\(f,b\)≔𝒜^S​\(f,b\)\+λ​𝒜^T​\(b\),𝒜^S​\(f,b\)≔‖f\+b−f∗‖n2,𝒜^T​\(b\)≔‖b‖m2\.\\displaystyle\\widehat\{\\mathcal\{A\}\}\(f,b\)\\coloneqq\\widehat\{\\mathcal\{A\}\}\_\{S\}\(f,b\)\+\\lambda\\widehat\{\\mathcal\{A\}\}\_\{T\}\(b\),\\qquad\\widehat\{\\mathcal\{A\}\}\_\{S\}\(f,b\)\\coloneqq\\\|f\+b\-f^\{\*\}\\\|\_\{n\}^\{2\},\\qquad\\widehat\{\\mathcal\{A\}\}\_\{T\}\(b\)\\coloneqq\\\|b\\\|\_\{m\}^\{2\}\.

##### Roadmap for[Theorem˜2](https://arxiv.org/html/2605.14280#Thmtheorem2)

The proof proceeds in four steps\. First, by using the decomposition \([10](https://arxiv.org/html/2605.14280#A1.E10)\) that underlies the proof of[Section˜2\.1](https://arxiv.org/html/2605.14280#S2.SS1), we convert the populationTILTobjective into the reweighted excess risk plus a nonnegative residual term\. Second,[Section˜B\.1](https://arxiv.org/html/2605.14280#A2.SS1)transfers control from the population risk to the empirical idealized risk\. Third,[Section˜B\.1](https://arxiv.org/html/2605.14280#A2.SS1)relates this idealized quantity to the actual empirical minimizer trained with noisy labels\. Finally, these three ingredients are combined and the approximation term is expanded by minimizing over the auxiliary function\. This yields the main stochastic inequality\. The final step is to rewrite the oracle term using the explicit minimizer over the auxiliary function\.

At a high level, the proof combines the risk decomposition, generalization control, and noise\-interaction control as follows:

λ​𝔼​\[ℰλ2​\(f^\)\]​≤⏟e​q​u​a​t​i​o​n​\([10](https://arxiv.org/html/2605.14280#A1.E10)\)​𝔼​𝒜​\(f^,b^\)\\displaystyle\\lambda\\mathbb\{E\}\[\\mathcal\{E\}\_\{\\lambda\}^\{2\}\(\\widehat\{f\}\)\]\\underbrace\{\\leq\}\_\{equation~\\eqref\{EqnOptimalRisk\}\}\\mathbb\{E\}\\,\\mathcal\{A\}\(\\widehat\{f\},\\widehat\{b\}\)≲⏟[SectionB\.1](https://arxiv.org/html/2605.14280#A2.SS1)​𝔼​𝒜^​\(f^,b^\)\+Gen\. Error\\displaystyle\\underbrace\{\\lesssim\}\_\{\\lx@cref\{creftypecap~refnum\}\{LemGeneralization\}\}\\mathbb\{E\}\\,\\widehat\{\\mathcal\{A\}\}\(\\widehat\{f\},\\widehat\{b\}\)\+\\text\{Gen\. Error\}≲⏟[SectionB\.1](https://arxiv.org/html/2605.14280#A2.SS1)​inff∈ℱ,b∈ℬ𝒜​\(f,b\)\+Est\. Error\+Gen\. Error\.\\displaystyle\\underbrace\{\\lesssim\}\_\{\\lx@cref\{creftypecap~refnum\}\{LemNoiseInteraction\}\}\\inf\_\{f\\in\\mathscr\{F\},b\\in\\mathscr\{B\}\}\\mathcal\{A\}\(f,b\)\+\\text\{Est\. Error\}\+\\text\{Gen\. Error\}\.Here the generalization error and estimation error are made explicit below in terms of the covering numbers of the function classℱ\+ℬ\\mathscr\{F\}\+\\mathscr\{B\}on the source samples, and the function classℬ\\mathscr\{B\}on the target samples\.

### B\.1Main argument

By the Risk Decomposition Identity \([10](https://arxiv.org/html/2605.14280#A1.E10)\), for any pair\(f,b\)\(f,b\), we have the upper boundλ​ℰλ2​\(f\)≤𝒜​\(f,b\)\\lambda\\mathcal\{E\}\_\{\\lambda\}^\{2\}\(f\)\\leq\\mathcal\{A\}\(f,b\)\.Applying this fact to the pair\(f^,b^\)\(\\widehat\{f\},\\widehat\{b\}\)returned by theTILTprocedure yields

λ​𝔼​\[ℰλ2​\(f^\)\]≤𝔼​𝒜​\(f^,b^\)\.\\displaystyle\\lambda\\mathbb\{E\}\[\\mathcal\{E\}\_\{\\lambda\}^\{2\}\(\\widehat\{f\}\)\]\\leq\\mathbb\{E\}\\,\\mathcal\{A\}\(\\widehat\{f\},\\widehat\{b\}\)\.\(11\)We next need to compare the expected population risk \([11](https://arxiv.org/html/2605.14280#A2.E11)\) with the ideal empirical risk associated with the samples\. The following lemma provides such a generalization bound:

###### Lemma 4\(Generalization bound\)\. Under the conditions of[Theorem˜2](https://arxiv.org/html/2605.14280#Thmtheorem2),for anyϵ1∈\(0,1\]\\epsilon\_\{1\}\\in\(0,1\],the learned estimators\(f^,b^\)\(\\widehat\{f\},\\widehat\{b\}\)satisfy𝔼​𝒜​\(f^,b^\)\\displaystyle\\mathbb\{E\}\\,\\mathcal\{A\}\(\\widehat\{f\},\\widehat\{b\}\)≤\(1\+ϵ1\)​𝔼​𝒜^​\(f^,b^\)\+c​B2ϵ1​\(log⁡𝒩n​\(δ;ℱ\+ℬ\)n\+λ​log⁡𝒩m​\(δ;ℬ\)m\)\+210​δ​B,\\displaystyle\\leq\(1\+\\epsilon\_\{1\}\)\\mathbb\{E\}\\,\\widehat\{\\mathcal\{A\}\}\(\\widehat\{f\},\\widehat\{b\}\)\+\\frac\{cB^\{2\}\}\{\\epsilon\_\{1\}\}\\left\(\\frac\{\\log\\mathcal\{N\}\_\{n\}\(\\delta;\\mathscr\{F\}\+\\mathscr\{B\}\)\}\{n\}\+\\lambda\\frac\{\\log\\mathcal\{N\}\_\{m\}\(\\delta;\\mathscr\{B\}\)\}\{m\}\\right\)\+210\\delta B,\(12\)whereccis a universal constant\.

See[Section˜B\.2](https://arxiv.org/html/2605.14280#A2.SS2)for the proof\.

Our next step is to control the expected empirical idealized risk𝔼​𝒜^​\(f^,b^\)\\mathbb\{E\}\\,\\widehat\{\\mathcal\{A\}\}\(\\widehat\{f\},\\widehat\{b\}\)\. The estimator is fitted with noisy labels, so the next lemma isolates the resulting noise interaction and converts the empirical risk into an oracle population quantity\. Let𝒩n​\(δ;ℱ\+ℬ\)\\mathcal\{N\}\_\{n\}\(\\delta;\\mathscr\{F\}\+\\mathscr\{B\}\)be theδ\\delta\-covering number of the sum classℋ=ℱ\+ℬ\\mathcal\{H\}=\\mathscr\{F\}\+\\mathscr\{B\}with respect to the empiricalL∞L\_\{\\infty\}norm on the source data\.

###### Lemma 5\(Noise interaction bound\)\. For anyδ\>0\\delta\>0andϵ∈\(0,1\]\\epsilon\\in\(0,1\], the learned estimator\(f^,b^\)\(\\widehat\{f\},\\widehat\{b\}\)satisfies𝔼​𝒜^​\(f^,b^\)≤\(1\+ϵ2\)​inff∈ℱ,b∈ℬ𝒜​\(f,b\)\+\(1\+ϵ2\)2ϵ2​12​σ2​log⁡𝒩n​\(δ;ℱ\+ℬ\)n\+cδ,\\displaystyle\\mathbb\{E\}\\,\\widehat\{\\mathcal\{A\}\}\(\\widehat\{f\},\\widehat\{b\}\)\\leq\(1\+\\epsilon\_\{2\}\)\\inf\_\{f\\in\\mathscr\{F\},b\\in\\mathscr\{B\}\}\\mathcal\{A\}\(f,b\)\+\\frac\{\(1\+\\epsilon\_\{2\}\)^\{2\}\}\{\\epsilon\_\{2\}\}\\frac\{12\\sigma^\{2\}\\log\\mathcal\{N\}\_\{n\}\(\\delta;\\mathscr\{F\}\+\\mathscr\{B\}\)\}\{n\}\+c\_\{\\delta\},\(13\)wherecδ=6​\(1\+ϵ2\)​σ​δc\_\{\\delta\}=6\(1\+\\epsilon\_\{2\}\)\\sigma\\delta\.

See[Section˜B\.3](https://arxiv.org/html/2605.14280#A2.SS3)for the proof\.

Combining the bound \([13](https://arxiv.org/html/2605.14280#A2.E13)\) with the bound \([12](https://arxiv.org/html/2605.14280#A2.E12)\) yields

𝔼​𝒜​\(f^,b^\)\\displaystyle\\mathbb\{E\}\\,\\mathcal\{A\}\(\\widehat\{f\},\\widehat\{b\}\)≤\(1\+ϵ1\)​\(1\+ϵ2\)​inff,b𝒜​\(f,b\)\+\(1\+ϵ1\)​\(1\+ϵ2\)2ϵ2​12​σ2​log⁡𝒩nn\\displaystyle\\leq\(1\+\\epsilon\_\{1\}\)\(1\+\\epsilon\_\{2\}\)\\inf\_\{f,b\}\\mathcal\{A\}\(f,b\)\+\(1\+\\epsilon\_\{1\}\)\\frac\{\(1\+\\epsilon\_\{2\}\)^\{2\}\}\{\\epsilon\_\{2\}\}\\frac\{12\\sigma^\{2\}\\log\\mathcal\{N\}\_\{n\}\}\{n\}\+c1​B2ϵ1​\(log⁡𝒩n\+1n\+λ​\(log⁡𝒩m\+1\)m\)\+Rem⁡\(δ\),\\displaystyle\\quad\+\\frac\{c\_\{1\}B^\{2\}\}\{\\epsilon\_\{1\}\}\\left\(\\frac\{\\log\\mathcal\{N\}\_\{n\}\+1\}\{n\}\+\\frac\{\\lambda\(\\log\\mathcal\{N\}\_\{m\}\+1\)\}\{m\}\\right\)\+\\operatorname\{Rem\}\(\\delta\),whereRem⁡\(δ\)≔210​δ​B\+6​\(1\+ϵ2\)​σ​δ\\operatorname\{Rem\}\(\\delta\)\\coloneqq 210\\delta B\+6\(1\+\\epsilon\_\{2\}\)\\sigma\\delta\.

We simplify the coefficients by settingϵ1=ϵ2=ϵ/3\\epsilon\_\{1\}=\\epsilon\_\{2\}=\\epsilon/3for someϵ∈\(0,1\]\\epsilon\\in\(0,1\]\. Then\(1\+ϵ1\)​\(1\+ϵ2\)=1\+ϵ1\+ϵ2\+ϵ1​ϵ2≤1\+ϵ\(1\+\\epsilon\_\{1\}\)\(1\+\\epsilon\_\{2\}\)=1\+\\epsilon\_\{1\}\+\\epsilon\_\{2\}\+\\epsilon\_\{1\}\\epsilon\_\{2\}\\leq 1\+\\epsilon\. The terms proportional to1/ϵ11/\\epsilon\_\{1\}and1/ϵ21/\\epsilon\_\{2\}are bothO​\(1/ϵ\)O\(1/\\epsilon\)\. Combining the source complexity terms from both theorems which scale withB2/nB^\{2\}/nandσ2/n\\sigma^\{2\}/nand the target complexity term which scales withB2/mB^\{2\}/m, we can find a sufficiently large constantccsuch that, using the convention𝒩n,𝒩m≥2\\mathcal\{N\}\_\{n\},\\mathcal\{N\}\_\{m\}\\geq 2to absorb additive constants into the logarithms,

𝔼​𝒜​\(f^,b^\)≤\(1\+ϵ\)​inff,b𝒜​\(f,b\)\+c​\(B2\+σ2\)ϵ​\(log⁡𝒩nn\+λ​log⁡𝒩mm\)\+Rem⁡\(δ\)\.\\displaystyle\\mathbb\{E\}\\,\\mathcal\{A\}\(\\widehat\{f\},\\widehat\{b\}\)\\leq\(1\+\\epsilon\)\\inf\_\{f,b\}\\mathcal\{A\}\(f,b\)\+\\frac\{c\(B^\{2\}\+\\sigma^\{2\}\)\}\{\\epsilon\}\\left\(\\frac\{\\log\\mathcal\{N\}\_\{n\}\}\{n\}\+\\frac\{\\lambda\\log\\mathcal\{N\}\_\{m\}\}\{m\}\\right\)\+\\operatorname\{Rem\}\(\\delta\)\.The decomposition of the infimum term follows directly from equation \([10](https://arxiv.org/html/2605.14280#A1.E10)\), in particular by substituting the definition of𝒜​\(f,b\)\\mathcal\{A\}\(f,b\)and minimizing overbbfor a fixedff\. This completes the proof of the theorem\.

### B\.2Proof of[Section˜B\.1](https://arxiv.org/html/2605.14280#A2.SS1)

The excess auxiliary risk can be decomposed as𝒜​\(f,b\)=𝒜S​\(f,b\)\+λ​𝒜T​\(b\)\\mathcal\{A\}\(f,b\)=\\mathcal\{A\}\_\{S\}\(f,b\)\+\\lambda\\mathcal\{A\}\_\{T\}\(b\), where𝒜S​\(f,b\)≔‖f\+b−f∗‖ℙ2\\mathcal\{A\}\_\{S\}\(f,b\)\\coloneqq\\\|f\+b\-f^\{\*\}\\\|\_\{\\mathbb\{P\}\}^\{2\},and𝒜T​\(b\)≔‖b‖ℚ2\\mathcal\{A\}\_\{T\}\(b\)\\coloneqq\\\|b\\\|\_\{\\mathbb\{Q\}\}^\{2\}\. Using this notation, we have the decomposition

\|𝔼​𝒜​\(f^,b^\)−𝔼​𝒜^​\(f^,b^\)\|\\displaystyle\\left\|\\mathbb\{E\}\\,\\mathcal\{A\}\(\\widehat\{f\},\\widehat\{b\}\)\-\\mathbb\{E\}\\,\\widehat\{\\mathcal\{A\}\}\(\\widehat\{f\},\\widehat\{b\}\)\\right\|≤\|𝔼​𝒜S​\(f^,b^\)−𝔼​𝒜^S​\(f^,b^\)\|⏟≡T1\+λ​\|𝔼​𝒜T​\(b^\)−𝔼​𝒜^T​\(b^\)\|⏟≡T2\\displaystyle\\leq\\underbrace\{\\left\|\\mathbb\{E\}\\,\\mathcal\{A\}\_\{S\}\(\\widehat\{f\},\\widehat\{b\}\)\-\\mathbb\{E\}\\,\\widehat\{\\mathcal\{A\}\}\_\{S\}\(\\widehat\{f\},\\widehat\{b\}\)\\right\|\}\_\{\\equiv T\_\{1\}\}\+\\lambda\\underbrace\{\\left\|\\mathbb\{E\}\\,\\mathcal\{A\}\_\{T\}\(\\widehat\{b\}\)\-\\mathbb\{E\}\\,\\widehat\{\\mathcal\{A\}\}\_\{T\}\(\\widehat\{b\}\)\\right\|\}\_\{\\equiv T\_\{2\}\}Note that functions of the formf\+bf\+bare3​B3B\-uniformly bounded, whereas the functionbbisBB\-uniformly bounded\. Thus, we can apply[Section˜C\.4](https://arxiv.org/html/2605.14280#A3.SS4)to control each term\.

Beginning withT1T\_\{1\}, settingBℋ=3​BB\_\{\\mathcal\{H\}\}=3Byields

T1\\displaystyle T\_\{1\}≤3​Bn​𝔼​𝒜S​\(f^,b^\)​36​log⁡𝒩n\+256\+9​B2​6​log⁡𝒩n\+11n\+78​δ​B\.\\displaystyle\\leq\\frac\{3B\}\{\\sqrt\{n\}\}\\sqrt\{\\mathbb\{E\}\\,\\mathcal\{A\}\_\{S\}\(\\widehat\{f\},\\widehat\{b\}\)\}\\sqrt\{36\\log\\mathcal\{N\}\_\{n\}\+256\}\+9B^\{2\}\\frac\{6\\log\\mathcal\{N\}\_\{n\}\+11\}\{n\}\+78\\delta B\.\(14a\)As forT2T\_\{2\}, settingBℋ=BB\_\{\\mathcal\{H\}\}=ByieldsT2\\displaystyle T\_\{2\}≤Bm​𝔼​𝒜T​\(b^\)​36​log⁡𝒩m\+256\+B2​6​log⁡𝒩m\+11m\+26​δ​B\.\\displaystyle\\leq\\frac\{B\}\{\\sqrt\{m\}\}\\sqrt\{\\mathbb\{E\}\\,\\mathcal\{A\}\_\{T\}\(\\widehat\{b\}\)\}\\sqrt\{36\\log\\mathcal\{N\}\_\{m\}\+256\}\+B^\{2\}\\frac\{6\\log\\mathcal\{N\}\_\{m\}\+11\}\{m\}\+26\\delta B\.\(14b\)In order to combine these two bounds, we use the Cauchy–Schwarz inequality in the formu​s\+λ​v​\(λ​t\)≤u\+λ​v​s2\+λ​t2,\\displaystyle\\sqrt\{u\}s\+\\sqrt\{\\lambda v\}\(\\sqrt\{\\lambda\}t\)\\leq\\sqrt\{u\+\\lambda v\}\\sqrt\{s^\{2\}\+\\lambda t^\{2\}\},\(14c\)
With the choicesu=𝔼​𝒜S​\(f^,b^\)u=\\mathbb\{E\}\\,\\mathcal\{A\}\_\{S\}\(\\widehat\{f\},\\widehat\{b\}\)andv=𝔼​𝒜T​\(b^\)v=\\mathbb\{E\}\\,\\mathcal\{A\}\_\{T\}\(\\widehat\{b\}\), we haveu\+λ​v=𝔼​𝒜​\(f^,b^\)u\+\\lambda v=\\mathbb\{E\}\\,\\mathcal\{A\}\(\\widehat\{f\},\\widehat\{b\}\)\. We then set

s2=9​B2​\(36​log⁡𝒩n\+256\)nandt2=B2​\(36​log⁡𝒩m\+256\)m\.s^\{2\}=\\frac\{9B^\{2\}\(36\\log\\mathcal\{N\}\_\{n\}\+256\)\}\{n\}\\qquad\\text\{and\}\\qquad t^\{2\}=\\frac\{B^\{2\}\(36\\log\\mathcal\{N\}\_\{m\}\+256\)\}\{m\}\.Applying inequality \([14c](https://arxiv.org/html/2605.14280#A2.E14.3)\) and combining with the bounds \([14a](https://arxiv.org/html/2605.14280#A2.E14.1)\) and \([14b](https://arxiv.org/html/2605.14280#A2.E14.2)\) then yields

\|𝔼​𝒜​\(f^,b^\)−𝔼​𝒜^​\(f^,b^\)\|\\displaystyle\\left\|\\mathbb\{E\}\\,\\mathcal\{A\}\(\\widehat\{f\},\\widehat\{b\}\)\-\\mathbb\{E\}\\,\\widehat\{\\mathcal\{A\}\}\(\\widehat\{f\},\\widehat\{b\}\)\\right\|≤𝔼​𝒜​\(f^,b^\)​9​B2​\(36​log⁡𝒩n\+256\)n\+λ​B2​\(36​log⁡𝒩m\+256\)m\\displaystyle\\leq\\sqrt\{\\mathbb\{E\}\\,\\mathcal\{A\}\(\\widehat\{f\},\\widehat\{b\}\)\}\\sqrt\{\\frac\{9B^\{2\}\(36\\log\\mathcal\{N\}\_\{n\}\+256\)\}\{n\}\+\\lambda\\frac\{B^\{2\}\(36\\log\\mathcal\{N\}\_\{m\}\+256\)\}\{m\}\}\+9​B2​6​log⁡𝒩n\+11n\+λ​B2​6​log⁡𝒩m\+11m\+104​δ​B\.\\displaystyle\\quad\+9B^\{2\}\\frac\{6\\log\\mathcal\{N\}\_\{n\}\+11\}\{n\}\+\\lambda B^\{2\}\\frac\{6\\log\\mathcal\{N\}\_\{m\}\+11\}\{m\}\+104\\delta B\.We can write this inequality more compactly in the form\|s−t\|≤2​s​u\+v\|s\-t\|\\leq 2\\sqrt\{su\}\+v, wheres≔𝔼​𝒜​\(f^,b^\)s\\coloneqq\\mathbb\{E\}\\,\\mathcal\{A\}\(\\widehat\{f\},\\widehat\{b\}\),t≔𝔼​𝒜^​\(f^,b^\)t\\coloneqq\\mathbb\{E\}\\,\\widehat\{\\mathcal\{A\}\}\(\\widehat\{f\},\\widehat\{b\}\), and

u\\displaystyle u≔14​\(9​B2​\(36​log⁡𝒩n\+256\)n\+λ​B2​\(36​log⁡𝒩m\+256\)m\),\\displaystyle\\coloneqq\\frac\{1\}\{4\}\\left\(\\frac\{9B^\{2\}\(36\\log\\mathcal\{N\}\_\{n\}\+256\)\}\{n\}\+\\lambda\\frac\{B^\{2\}\(36\\log\\mathcal\{N\}\_\{m\}\+256\)\}\{m\}\\right\),v\\displaystyle v≔9​B2​6​log⁡𝒩n\+11n\+λ​B2​6​log⁡𝒩m\+11m\+104​δ​B\.\\displaystyle\\coloneqq 9B^\{2\}\\frac\{6\\log\\mathcal\{N\}\_\{n\}\+11\}\{n\}\+\\lambda B^\{2\}\\frac\{6\\log\\mathcal\{N\}\_\{m\}\+11\}\{m\}\+104\\delta B\.Now our inequality implies thats≤t\+2​s​u\+v≤t\+η​s\+uη\+vs\\leq t\+2\\sqrt\{su\}\+v\\;\\leq\\;t\+\\eta s\+\\frac\{u\}\{\\eta\}\+v, where the second inequality follows by Young’s inequality, valid for anyη∈\(0,1\)\\eta\\in\(0,1\)\. Re\-arranging yields

s\\displaystyle s≤t1−η\+v1−η\+uη​\(1−η\)\.\\displaystyle\\leq\\frac\{t\}\{1\-\\eta\}\+\\frac\{v\}\{1\-\\eta\}\+\\frac\{u\}\{\\eta\\,\(1\-\\eta\)\}\.We then chooseη=ϵ1\+ϵ\\eta=\\frac\{\\epsilon\}\{1\+\\epsilon\}for someϵ∈\(0,1\)\\epsilon\\in\(0,1\), so that11−η=1\+ϵ\\frac\{1\}\{1\-\\eta\}=1\+\\epsilon, thereby obtaining the bound

s\\displaystyle s≤\(1\+ϵ\)​t\+\(1\+ϵ\)​v\+\(1\+ϵ\)2ϵ​u≤\(1\+ϵ\)​t\+2​v\+4ϵ​u,\\displaystyle\\leq\(1\+\\epsilon\)t\+\(1\+\\epsilon\)v\+\\frac\{\(1\+\\epsilon\)^\{2\}\}\{\\epsilon\}u\\;\\leq\\;\(1\+\\epsilon\)t\+2v\+\\frac\{4\}\{\\epsilon\}u,where the second step follows since1\+ϵ≤21\+\\epsilon\\leq 2\. Recalling our definitions of\(u,v\)\(u,v\), we have

2​v\+4ϵ​u\\displaystyle 2v\+\\frac\{4\}\{\\epsilon\}u≤18​B2​6​log⁡𝒩n\+11n\+2​λ​B2​6​log⁡𝒩m\+11m\+208​δ​B\\displaystyle\\leq 18B^\{2\}\\frac\{6\\log\\mathcal\{N\}\_\{n\}\+11\}\{n\}\+2\\lambda B^\{2\}\\frac\{6\\log\\mathcal\{N\}\_\{m\}\+11\}\{m\}\+208\\delta B\+1ϵ​\(9​B2​\(36​log⁡𝒩n\+256\)n\+λ​B2​\(36​log⁡𝒩m\+256\)m\)\.\\displaystyle\\quad\+\\frac\{1\}\{\\epsilon\}\\left\(\\frac\{9B^\{2\}\(36\\log\\mathcal\{N\}\_\{n\}\+256\)\}\{n\}\+\\lambda\\frac\{B^\{2\}\(36\\log\\mathcal\{N\}\_\{m\}\+256\)\}\{m\}\\right\)\.Since1/ϵ≥11/\\epsilon\\geq 1, we have shown that

𝔼​𝒜​\(f^,b^\)≤\(1\+ϵ\)​𝔼​𝒜^​\(f^,b^\)\+B2ϵ​\(c1​log⁡𝒩n\+c2n\+λ​c3​log⁡𝒩m\+c4m\)\+208​δ​B\.\\displaystyle\\mathbb\{E\}\\,\\mathcal\{A\}\(\\widehat\{f\},\\widehat\{b\}\)\\leq\(1\+\\epsilon\)\\mathbb\{E\}\\,\\widehat\{\\mathcal\{A\}\}\(\\widehat\{f\},\\widehat\{b\}\)\+\\frac\{B^\{2\}\}\{\\epsilon\}\\left\(\\frac\{c\_\{1\}\\log\\mathcal\{N\}\_\{n\}\+c\_\{2\}\}\{n\}\+\\lambda\\frac\{c\_\{3\}\\log\\mathcal\{N\}\_\{m\}\+c\_\{4\}\}\{m\}\\right\)\+208\\delta B\.The asserted bound follows after enlarging the universal constants\.

### B\.3Proof of[Section˜B\.1](https://arxiv.org/html/2605.14280#A2.SS1)

We begin by relating the empirical objective minimized by the estimator to the ideal risk\. Recall that our estimator\(f^,b^\)\(\\widehat\{f\},\\widehat\{b\}\)minimizes the empirical objectiveℒ^​\(f,b\)\\widehat\{\\mathcal\{L\}\}\(f,b\)\. The source component involves the pairs\(xi,yi\)\(x\_\{i\},y\_\{i\}\), and we can writeyi=f∗​\(xi\)\+ξiy\_\{i\}=f^\{\*\}\(x\_\{i\}\)\+\\xi\_\{i\}, wheref∗​\(x\)=𝔼​\[Y∣X=x\]f^\{\*\}\(x\)=\\mathbb\{E\}\[Y\\mid X=x\], andξi\\xi\_\{i\}is conditionally zero\-mean noise\.

Let\(f,b\)\(f,b\)be any fixed pair of functions inℱ×ℬ\\mathscr\{F\}\\times\\mathscr\{B\}\. By the optimality of\(f^,b^\)\(\\widehat\{f\},\\widehat\{b\}\), we have the basic inequalityℒ^​\(f^,b^\)≤ℒ^​\(f,b\)\\widehat\{\\mathcal\{L\}\}\(\\widehat\{f\},\\widehat\{b\}\)\\leq\\widehat\{\\mathcal\{L\}\}\(f,b\)\. We now expand the squared loss term inℒ^\\widehat\{\\mathcal\{L\}\}\. For any functionshh, the empirical source loss is given by

1n​∑i=1n\(h​\(xi\)−yi\)2\\displaystyle\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\(h\(x\_\{i\}\)\-y\_\{i\}\)^\{2\}=1n​∑i=1n\(h​\(xi\)−f∗​\(xi\)−ξi\)2\\displaystyle=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\(h\(x\_\{i\}\)\-f^\{\*\}\(x\_\{i\}\)\-\\xi\_\{i\}\)^\{2\}=1n​∑i=1n\(h​\(xi\)−f∗​\(xi\)\)2−2n​∑i=1nξi​\(h​\(xi\)−f∗​\(xi\)\)\+1n​∑i=1nξi2\.\\displaystyle=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\(h\(x\_\{i\}\)\-f^\{\*\}\(x\_\{i\}\)\)^\{2\}\-\\frac\{2\}\{n\}\\sum\_\{i=1\}^\{n\}\\xi\_\{i\}\(h\(x\_\{i\}\)\-f^\{\*\}\(x\_\{i\}\)\)\+\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\xi\_\{i\}^\{2\}\.Substituting this expansion into the inequalityℒ^​\(f^,b^\)≤ℒ^​\(f,b\)\\widehat\{\\mathcal\{L\}\}\(\\widehat\{f\},\\widehat\{b\}\)\\leq\\widehat\{\\mathcal\{L\}\}\(f,b\)and re\-arranging, we find that

‖f^\+b^−f∗‖n2\+λ​‖b^‖m2≤‖f\+b−f∗‖n2\+λ​‖b‖m2\+2n​∑i=1nξi​\(f^​\(xi\)−f​\(xi\)\+b^​\(xi\)−b​\(xi\)\)\.\\displaystyle\\\|\\widehat\{f\}\+\\widehat\{b\}\-f^\{\*\}\\\|\_\{n\}^\{2\}\+\\lambda\\\|\\widehat\{b\}\\\|\_\{m\}^\{2\}\\leq\\\|f\+b\-f^\{\*\}\\\|\_\{n\}^\{2\}\+\\lambda\\\|b\\\|\_\{m\}^\{2\}\+\\frac\{2\}\{n\}\\sum\_\{i=1\}^\{n\}\\xi\_\{i\}\(\\widehat\{f\}\(x\_\{i\}\)\-f\(x\_\{i\}\)\+\\widehat\{b\}\(x\_\{i\}\)\-b\(x\_\{i\}\)\)\.We now take the expectation over the dataset realizations\. Recall that𝔼​𝒜^​\(f^,b^\)=𝔼​\[‖f^\+b^−f∗‖n2\+λ​‖b^‖m2\]\\mathbb\{E\}\\,\\widehat\{\\mathcal\{A\}\}\(\\widehat\{f\},\\widehat\{b\}\)=\\mathbb\{E\}\[\\\|\\widehat\{f\}\+\\widehat\{b\}\-f^\{\*\}\\\|\_\{n\}^\{2\}\+\\lambda\\\|\\widehat\{b\}\\\|\_\{m\}^\{2\}\]\. For the fixed functionsf,bf,bon the right\-hand side, the expectation of the noise interaction term is zero because𝔼​\[ξi\|xi\]=0\\mathbb\{E\}\[\\xi\_\{i\}\|x\_\{i\}\]=0andf,bf,bare independent ofξ\\xi\. Furthermore,𝔼​\[‖f\+b−f∗‖n2\+λ​‖b‖m2\]=𝒜​\(f,b\)\\mathbb\{E\}\[\\\|f\+b\-f^\{\*\}\\\|\_\{n\}^\{2\}\+\\lambda\\\|b\\\|\_\{m\}^\{2\}\]=\\mathcal\{A\}\(f,b\)\. Using these facts, we arrive at the inequality

𝔼​𝒜^​\(f^,b^\)≤𝒜​\(f,b\)\+𝔼​\[2n​∑i=1nξi​\(f^​\(xi\)\+b^​\(xi\)−f∗​\(xi\)\)\]\.\\displaystyle\\mathbb\{E\}\\,\\widehat\{\\mathcal\{A\}\}\(\\widehat\{f\},\\widehat\{b\}\)\\leq\\mathcal\{A\}\(f,b\)\+\\mathbb\{E\}\\left\[\\frac\{2\}\{n\}\\sum\_\{i=1\}^\{n\}\\xi\_\{i\}\(\\widehat\{f\}\(x\_\{i\}\)\+\\widehat\{b\}\(x\_\{i\}\)\-f^\{\*\}\(x\_\{i\}\)\)\\right\]\.\(15\)The core difficulty lies in bounding the expected noise interaction term on the right\-hand side, asf^\\widehat\{f\}andb^\\widehat\{b\}depend on the noiseξ\\xi\. We do so using a covering number argument\.

Letℋ=\{f\+b−f∗∣f∈ℱ,b∈ℬ\}\\mathcal\{H\}=\\\{f\+b\-f^\{\*\}\\mid f\\in\\mathscr\{F\},b\\in\\mathscr\{B\}\\\}\. Leth^=f^\+b^−f∗∈ℋ\\hat\{h\}=\\widehat\{f\}\+\\widehat\{b\}\-f^\{\*\}\\in\\mathcal\{H\}\. We seek to bound the random variable

Z≔𝔼​\[2n​∑i=1nξi​h^​\(xi\)\]\.\\displaystyle Z\\coloneqq\\mathbb\{E\}\[\\frac\{2\}\{n\}\\sum\_\{i=1\}^\{n\}\\xi\_\{i\}\\hat\{h\}\(x\_\{i\}\)\]\.Consider a minimalδ\\delta\-covering ofℋ\\mathcal\{H\}with respect to the empiricalL∞L\_\{\\infty\}norm, denoted by\{h1,…,hN\}\\\{h\_\{1\},\\dots,h\_\{N\}\\\}, whereN=𝒩n​\(δ;ℱ\+ℬ\)N=\\mathcal\{N\}\_\{n\}\(\\delta;\\mathscr\{F\}\+\\mathscr\{B\}\)\. By construction there exists a \(random\) indexj∗j^\{\*\}such that‖h^−hj∗‖∞≤δ\\\|\\hat\{h\}\-h\_\{j^\{\*\}\}\\\|\_\{\\infty\}\\leq\\delta\. Using this fact, we can decomposeZZas

Z=2n​∑i=1nξi​\(h^​\(xi\)−hj∗​\(xi\)\)\+2n​∑i=1nξi​hj∗​\(xi\)\.\\displaystyle Z=\\frac\{2\}\{n\}\\sum\_\{i=1\}^\{n\}\\xi\_\{i\}\(\\hat\{h\}\(x\_\{i\}\)\-h\_\{j^\{\*\}\}\(x\_\{i\}\)\)\+\\frac\{2\}\{n\}\\sum\_\{i=1\}^\{n\}\\xi\_\{i\}h\_\{j^\{\*\}\}\(x\_\{i\}\)\.For the first term, by Hölder’s inequality and the bound on the cover, we have\|2n​∑ξi​\(h^−hj∗\)\|≤2n​∑\|ξi\|​δ\|\\frac\{2\}\{n\}\\sum\\xi\_\{i\}\(\\hat\{h\}\-h\_\{j^\{\*\}\}\)\|\\leq\\frac\{2\}\{n\}\\sum\|\\xi\_\{i\}\|\\delta\. Taking expectations, this is bounded by2​δ​𝔼​\[\|ξ\|\]≤2​δ​σ2\\delta\\mathbb\{E\}\[\|\\xi\|\]\\leq 2\\delta\\sigmausing Jensen’s inequality\.

As for the second term, we letζj=∑iξi​hj​\(xi\)n​‖hj‖n\\zeta\_\{j\}=\\frac\{\\sum\_\{i\}\\xi\_\{i\}h\_\{j\}\(x\_\{i\}\)\}\{\\sqrt\{n\}\\\|h\_\{j\}\\\|\_\{n\}\}be a normalized noise interaction for the fixed functionhjh\_\{j\}\. Conditioned on\{xi\}\\\{x\_\{i\}\\\}, the auxiliary r\.v\.ζj\\zeta\_\{j\}is a weighted sum of independent sub\-Gaussian variables, which is itself sub\-Gaussian with variance proxyσ2\\sigma^\{2\}\. We can rewrite the second term as:

𝔼​\[2n​∑i=1nξi​hj∗​\(xi\)\]≤𝔼​\|2n​‖hj∗‖n​ζj∗\|\\displaystyle\\mathbb\{E\}\\left\[\\frac\{2\}\{n\}\\sum\_\{i=1\}^\{n\}\\xi\_\{i\}h\_\{j^\{\*\}\}\(x\_\{i\}\)\\right\]\\leq\\mathbb\{E\}\\left\|\\frac\{2\}\{\\sqrt\{n\}\}\\\|h\_\{j^\{\*\}\}\\\|\_\{n\}\\zeta\_\{j^\{\*\}\}\\right\|Applying the Cauchy–Schwarz inequality yields

𝔼​\[2n​‖hj∗‖n​\|ζj∗\|\]≤2n​\(𝔼​\[‖hj∗‖n2\]\)1/2​\(𝔼​\[maxj∈\[N\]⁡ζj2\]\)1/2\.\\displaystyle\\mathbb\{E\}\\left\[\\frac\{2\}\{\\sqrt\{n\}\}\\\|h\_\{j^\{\*\}\}\\\|\_\{n\}\|\\zeta\_\{j^\{\*\}\}\|\\right\]\\leq\\frac\{2\}\{\\sqrt\{n\}\}\\left\(\\mathbb\{E\}\[\\\|h\_\{j^\{\*\}\}\\\|\_\{n\}^\{2\}\]\\right\)^\{1/2\}\\left\(\\mathbb\{E\}\[\\max\_\{j\\in\[N\]\}\\zeta\_\{j\}^\{2\}\]\\right\)^\{1/2\}\.By standard results on the maxima of sub\-Gaussian variables, we have

𝔼​\[maxj=1,…,N⁡ζj2\]≤σ2​\(3​log⁡N\+1\)\.\\displaystyle\\mathbb\{E\}\[\\max\_\{j=1,\\ldots,N\}\\zeta\_\{j\}^\{2\}\]\\leq\\sigma^\{2\}\(3\\log N\+1\)\.For the norm term, we have‖hj∗‖n≤‖h^‖n\+δ\\\|h\_\{j^\{\*\}\}\\\|\_\{n\}\\leq\\\|\\hat\{h\}\\\|\_\{n\}\+\\deltafollows from the definition ofhj∗h\_\{j^\{\*\}\}and the triangle inequality\. Also, the triangle inequality inL2​\(𝔼​\[⋅\]\)L\_\{2\}\(\\mathbb\{E\}\[\\cdot\]\)yields that𝔼​\[‖hj∗‖n2\]1/2≤𝔼​\[‖h^‖n2\]1/2\+δ\\mathbb\{E\}\[\\\|h\_\{j^\{\*\}\}\\\|\_\{n\}^\{2\}\]^\{1/2\}\\leq\\mathbb\{E\}\[\\\|\\hat\{h\}\\\|\_\{n\}^\{2\}\]^\{1/2\}\+\\delta\. Remaining𝒜^​\(f^,b^\)=‖h^‖n2\+λ​‖b^‖m2\\widehat\{\\mathcal\{A\}\}\(\\widehat\{f\},\\widehat\{b\}\)=\\\|\\hat\{h\}\\\|\_\{n\}^\{2\}\+\\lambda\\\|\\widehat\{b\}\\\|\_\{m\}^\{2\}, so we have𝔼​\[‖hj∗‖n2\]1/2≤𝔼​𝒜^​\(f^,b^\)\+δ\\mathbb\{E\}\[\\\|h\_\{j^\{\*\}\}\\\|\_\{n\}^\{2\}\]^\{1/2\}\\leq\\sqrt\{\\mathbb\{E\}\\,\\widehat\{\\mathcal\{A\}\}\(\\widehat\{f\},\\widehat\{b\}\)\}\+\\delta\.

Putting together the pieces, we have established the upper bound

𝔼​\[Z\]\\displaystyle\\mathbb\{E\}\[Z\]≤2​σ​3​log⁡N\+1n​\(𝔼​𝒜^​\(f^,b^\)\+δ\)\+2​δ​σ\.\\displaystyle\\leq\\frac\{2\\sigma\\sqrt\{3\\log N\+1\}\}\{\\sqrt\{n\}\}\\left\(\\sqrt\{\\mathbb\{E\}\\,\\widehat\{\\mathcal\{A\}\}\(\\widehat\{f\},\\widehat\{b\}\)\}\+\\delta\\right\)\+2\\delta\\sigma\.Substituting this upper bound back into equation \([15](https://arxiv.org/html/2605.14280#A2.E15)\), we find that

𝔼​𝒜^​\(f^,b^\)\\displaystyle\\mathbb\{E\}\\,\\widehat\{\\mathcal\{A\}\}\(\\widehat\{f\},\\widehat\{b\}\)≤𝒜​\(f,b\)\+2​𝔼​𝒜^​\(f^,b^\)​σ2​\(3​log⁡N\+1\)n\+δ​\(2​σ​3​log⁡N\+1n\+2​σ\)\\displaystyle\\leq\\mathcal\{A\}\(f,b\)\+2\\sqrt\{\\mathbb\{E\}\\,\\widehat\{\\mathcal\{A\}\}\(\\widehat\{f\},\\widehat\{b\}\)\}\\sqrt\{\\frac\{\\sigma^\{2\}\(3\\log N\+1\)\}\{n\}\}\+\\delta\\left\(2\\sigma\\sqrt\{\\frac\{3\\log N\+1\}\{n\}\}\+2\\sigma\\right\)≤𝒜​\(f,b\)\+2​𝔼​𝒜^​\(f^,b^\)​σ2​\(3​log⁡N\+1\)n\+6​δ​σ,\\displaystyle\\leq\\mathcal\{A\}\(f,b\)\+2\\sqrt\{\\mathbb\{E\}\\,\\widehat\{\\mathcal\{A\}\}\(\\widehat\{f\},\\widehat\{b\}\)\}\\sqrt\{\\frac\{\\sigma^\{2\}\(3\\log N\+1\)\}\{n\}\}\+6\\delta\\sigma,where the second inequality follows from our assumption thatlog⁡N≤n\\log N\\leq n\. Note that this inequality is of the formu≤v\+2​u​s\+tu\\leq v\+2\\sqrt\{us\}\+twhereu=𝔼​𝒜^​\(f^,b^\)u=\\mathbb\{E\}\\,\\widehat\{\\mathcal\{A\}\}\(\\widehat\{f\},\\widehat\{b\}\),v=𝒜​\(f,b\)v=\\mathcal\{A\}\(f,b\),s=σ2​\(3​log⁡N\+1\)ns=\\sqrt\{\\frac\{\\sigma^\{2\}\(3\\log N\+1\)\}\{n\}\}andt=6​δ​σt=6\\delta\\sigma\. By Young’s inequality, it implies thatu≤v1−η\+sη​\(1−η\)\+t1−ηu\\leq\\frac\{v\}\{1\-\\eta\}\+\\frac\{s\}\{\\eta\\,\(1\-\\eta\)\}\+\\frac\{t\}\{1\-\\eta\}for anyη∈\(0,1\)\\eta\\in\(0,1\)\. We can chooseη∈\(0,1\)\\eta\\in\(0,1\)such that11−η=1\+ϵ\\frac\{1\}\{1\-\\eta\}=1\+\\epsilonfor an arbitraryϵ∈\(0,1\)\\epsilon\\in\(0,1\)\. Doing so yields

𝔼​𝒜^​\(f^,b^\)≤\(1\+ϵ\)​𝒜​\(f,b\)\+\(1\+ϵ\)2ϵ​σ2​\(3​log⁡N\+1\)n\+\(1\+ϵ\)​6​σ​δ\.\\displaystyle\\mathbb\{E\}\\,\\widehat\{\\mathcal\{A\}\}\(\\widehat\{f\},\\widehat\{b\}\)\\leq\(1\+\\epsilon\)\\mathcal\{A\}\(f,b\)\+\\frac\{\(1\+\\epsilon\)^\{2\}\}\{\\epsilon\}\\frac\{\\sigma^\{2\}\(3\\log N\+1\)\}\{n\}\+\(1\+\\epsilon\)6\\sigma\\delta\.Since this holds for any fixed\(f,b\)\(f,b\), we can take the infimum over the function classes, thereby obtaining the claimed result\.

## Appendix CProof of[Section˜3](https://arxiv.org/html/2605.14280#S3)

We now turn to the proof of[Section˜3](https://arxiv.org/html/2605.14280#S3)\.

### C\.1Sparse ReLU approximation tools

This section collects the analytic tools used to derive[Section˜3](https://arxiv.org/html/2605.14280#S3)from the oracle inequality in[Theorem˜2](https://arxiv.org/html/2605.14280#Thmtheorem2)\. We first recall the Hölder balls and sparse clipped ReLU classes used in the main text\. We then state the joint approximation lemma showing that the auxiliary class can approximate the population correctionbf∗​\(x\)=−vλ​\(x\)​\(f​\(x\)−f∗​\(x\)\)b\_\{f\}^\{\*\}\(x\)=\-v\_\{\\lambda\}\(x\)\(f\(x\)\-f^\{\*\}\(x\)\)while retaining explicit control of depth, sparsity, and multiplication precision\.

Recall that our result imposes Hölder smoothness conditions on the regression functionf∗f^\{\*\}and the offset weightvλv\_\{\\lambda\}\.

###### Definition 6\(Hölder space\)\.

For a domain𝒳⊂ℝd\\mathcal\{X\}\\subset\\mathbb\{R\}^\{d\}, smoothness indexβ\>0\\beta\>0, and radiusK\>0K\>0, the Hölder ballCdβ​\(𝒳,K\)C\_\{d\}^\{\\beta\}\(\\mathcal\{X\},K\)corresponds to the set of all functionsf:𝒳→ℝf:\\mathcal\{X\}\\to\\mathbb\{R\}such that

∑𝜶:\|𝜶\|<β‖∂𝜶f‖∞\+∑𝜶:\|𝜶\|=⌊β⌋sup𝐱,𝐲∈D𝐱≠𝐲\|∂𝜶f​\(𝐱\)−∂𝜶f​\(𝐲\)\|‖𝐱−𝐲‖∞β−⌊β⌋≤K,\\displaystyle\\sum\_\{\\boldsymbol\{\\alpha\}:\|\\boldsymbol\{\\alpha\}\|<\\beta\}\\\|\\partial^\{\\boldsymbol\{\\alpha\}\}f\\\|\_\{\\infty\}\+\\sum\_\{\\boldsymbol\{\\alpha\}:\|\\boldsymbol\{\\alpha\}\|=\\lfloor\\beta\\rfloor\}\\sup\_\{\\begin\{subarray\}\{c\}\\mathbf\{x\},\\mathbf\{y\}\\in D\\\\ \\mathbf\{x\}\\neq\\mathbf\{y\}\\end\{subarray\}\}\\frac\{\|\\partial^\{\\boldsymbol\{\\alpha\}\}f\(\\mathbf\{x\}\)\-\\partial^\{\\boldsymbol\{\\alpha\}\}f\(\\mathbf\{y\}\)\|\}\{\\\|\\mathbf\{x\}\-\\mathbf\{y\}\\\|\_\{\\infty\}^\{\\beta\-\\lfloor\\beta\\rfloor\}\}\\leq K,where𝛂=\(α1,…,αd\)∈ℕ0d\\boldsymbol\{\\alpha\}=\(\\alpha\_\{1\},\\dots,\\alpha\_\{d\}\)\\in\\mathbb\{N\}\_\{0\}^\{d\}is a multi\-index,\|𝛂\|=∑iαi\|\\boldsymbol\{\\alpha\}\|=\\sum\_\{i\}\\alpha\_\{i\}, and∂𝛂=∂1α1…​∂dαd\\partial^\{\\boldsymbol\{\\alpha\}\}=\\partial\_\{1\}^\{\\alpha\_\{1\}\}\\dots\\partial\_\{d\}^\{\\alpha\_\{d\}\}\.

Our result applies to theTILTprocedure applied using sparse ReLU neural networks, along with a clipping function that enforces bounded outputs\. We useσ​\(x\)=max⁡\(x,0\)\\sigma\(x\)=\\max\(x,0\)to denote the ReLU activation function, For a shift vector𝐯∈ℝr\\mathbf\{v\}\\in\\mathbb\{R\}^\{r\}, letσ𝐯:ℝr→ℝr\\sigma\_\{\\mathbf\{v\}\}:\\mathbb\{R\}^\{r\}\\to\\mathbb\{R\}^\{r\}be the component\-wise mapσ𝐯​\(y1,…,yr\)=\(σ​\(y1−v1\),…,σ​\(yr−vr\)\)⊤\\sigma\_\{\\mathbf\{v\}\}\(y\_\{1\},\\dots,y\_\{r\}\)=\(\\sigma\(y\_\{1\}\-v\_\{1\}\),\\dots,\\sigma\(y\_\{r\}\-v\_\{r\}\)\)^\{\\top\}\.

###### Definition 7\(Sparse deep ReLU networks\)\.

A ReLU network with architecture\(L,𝐩\)\(L,\\mathbf\{p\}\)is a function of the form

f​\(𝐱\)=WL​σ𝐯L​WL−1​σ𝐯L−1​…​W1​σ𝐯1​W0​𝐱,\\displaystyle f\(\\mathbf\{x\}\)=W\_\{L\}\\sigma\_\{\\mathbf\{v\}\_\{L\}\}W\_\{L\-1\}\\sigma\_\{\\mathbf\{v\}\_\{L\-1\}\}\\dots W\_\{1\}\\sigma\_\{\\mathbf\{v\}\_\{1\}\}W\_\{0\}\\mathbf\{x\},whereWi∈ℝpi\+1×piW\_\{i\}\\in\\mathbb\{R\}^\{p\_\{i\+1\}\\times p\_\{i\}\}are weight matrices and𝐯i∈ℝpi\\mathbf\{v\}\_\{i\}\\in\\mathbb\{R\}^\{p\_\{i\}\},i=1,…,Li=1,\\dots,L, are shift vectors\. We define the class𝙽𝙽​\(L,𝐩,s,B\)\\mathtt\{NN\}\(L,\\mathbf\{p\},s,B\)as the set of such functions satisfying:

1. \(i\)LLhidden layers and width vector𝐩=\(p0,…,pL\+1\)\\mathbf\{p\}=\(p\_\{0\},\\dots,p\_\{L\+1\}\)\.
2. \(ii\)Sparsity constraint:∑j=0L‖Wj‖0\+∑j=1L‖𝐯j‖0≤s\\sum\_\{j=0\}^\{L\}\\\|W\_\{j\}\\\|\_\{0\}\+\\sum\_\{j=1\}^\{L\}\\\|\\mathbf\{v\}\_\{j\}\\\|\_\{0\}\\leq s\.
3. \(iii\)Parameter bound:max0≤j≤L⁡‖Wj‖∞∨max1≤j≤L⁡‖𝐯j‖∞≤1\\max\_\{0\\leq j\\leq L\}\\\|W\_\{j\}\\\|\_\{\\infty\}\\lor\\max\_\{1\\leq j\\leq L\}\\\|\\mathbf\{v\}\_\{j\}\\\|\_\{\\infty\}\\leq 1\.
4. \(iv\)Clipping:‖f‖∞≤B\\\|f\\\|\_\{\\infty\}\\leq B\.

The next lemma combines the Hölder approximation theorem for sparse ReLU networks with a ReLU multiplication sub\-network of sizeMM\. This is required in analysis, since ReLU networks generate piecewise affine outputs, and so can approximate a productu​vuvonly up to some finite precision\. A multiplication sub\-network yields an approximation accurate to the order2−M2^\{\-M\}\. In addition, we usecj,j=0,1,2c\_\{j\},j=0,1,2etc\. to denote universal constants\.

###### Lemma 8\(Joint approximation by sparse ReLU networks\)\. Consider the sparse ReLU classesℱ=𝙽𝙽​\(Lf,𝐩f,sf,B\)\\mathscr\{F\}=\\mathtt\{NN\}\(L\_\{f\},\\mathbf\{p\}\_\{f\},s\_\{f\},B\)andℬ=𝙽𝙽​\(Lg,𝐩g,sg,B\)\\mathscr\{B\}=\\mathtt\{NN\}\(L\_\{g\},\\mathbf\{p\}\_\{g\},s\_\{g\},B\), and suppose thatf∗∈Cdβ​\(\[0,1\]d,K\)f^\{\*\}\\in C\_\{d\}^\{\\beta\}\(\[0,1\]^\{d\},K\)andvλ∈Cdγ​\(\[0,1\]d,K\)v\_\{\\lambda\}\\in C\_\{d\}^\{\\gamma\}\(\[0,1\]^\{d\},K\)\. Then we have:\(i\)The regression functionf∗f^\{\*\}satisfies the approximation boundinff∈ℱℰλ2​\(f\)≡inff∈ℱ𝔼ℚ​\[vλ​\(X\)​\(f​\(X\)−f∗​\(X\)\)2\]\\displaystyle\\inf\_\{f\\in\\mathscr\{F\}\}\\mathcal\{E\}\_\{\\lambda\}^\{2\}\(f\)\\equiv\\inf\_\{f\\in\\mathscr\{F\}\}\\mathbb\{E\}\_\{\\mathbb\{Q\}\}\\Big\[v\_\{\\lambda\}\(X\)\\big\(f\(X\)\-f^\{\*\}\(X\)\\big\)^\{2\}\\Big\]≤c0​K2​sf−2​β/d​log2⁡\(sf\)\.\\displaystyle\\leq c\_\{0\}K^\{2\}s\_\{f\}^\{\-2\\beta/d\}\\log^\{2\}\(s\_\{f\}\)\.\(16a\)\(ii\)For anyf∈ℱf\\in\\mathscr\{F\}, any sparsity levelsv≥1s\_\{v\}\\geq 1, and any multiplication precisionM=1,2,…M=1,2,\\ldots, there exists a functionb∈ℬb\\in\\mathscr\{B\}such that for any architecture withLg≥c1​\{Lf\+log⁡sf\+log⁡sv\+M\},and​sg≥c2​\{sf\+sv\+M​log⁡M\},\\displaystyle L\_\{g\}\\geq c\_\{1\}\\big\\\{L\_\{f\}\+\\log s\_\{f\}\+\\log s\_\{v\}\+M\\big\\\},\\;\\mbox\{and\}\\;s\_\{g\}\\geq c\_\{2\}\\big\\\{s\_\{f\}\+s\_\{v\}\+M\\log M\\big\\\},\(16b\)we have‖b−bf∗‖ℙλ2≤c3​\(sf−2​β/d​log2⁡sf\+sv−2​γ/d​log2⁡sv\+2−2​M\),\\displaystyle\\\|b\-b^\{\*\}\_\{f\}\\\|\_\{\\mathbb\{P\}\_\{\\lambda\}\}^\{2\}\\leq c\_\{3\}\\left\(s\_\{f\}^\{\-2\\beta/d\}\\log^\{2\}s\_\{f\}\+s\_\{v\}^\{\-2\\gamma/d\}\\log^\{2\}s\_\{v\}\+2^\{\-2M\}\\right\),\(16c\)where the constants may depend on the fixed problem parameters\(d,β,γ,B,K\)\(d,\\beta,\\gamma,B,K\), but not onsf,svs\_\{f\},s\_\{v\}orMM\.

See[Section˜C\.3](https://arxiv.org/html/2605.14280#A3.SS3)for the proof\.

### C\.2Main argument

We now turn to the proof of[Section˜3](https://arxiv.org/html/2605.14280#S3)\. Throughout, we use\(c,c′,cj\)\(c,c^\{\\prime\},c\_\{j\}\)etc\. to denote constants independent of the sample sizes and network sizes, while allowing dependence on the fixed smoothness, boundedness, and dimension parameters, including the Hölder radiusKK\. When this radius is uniform inλ\\lambda, the only displayedλ\\lambda\-dependence is throughκλ\\kappa\_\{\\lambda\}andnλn\_\{\\lambda\}\. From the statement of the corollary, recall the shorthand notationnλ≔min⁡\{λ​n,m\}n\_\{\\lambda\}\\coloneqq\\min\\big\\\{\\lambda n,m\\big\\\}andκλ≔\(1\+λ\)/λ\\kappa\_\{\\lambda\}\\coloneqq\(1\+\\lambda\)/\\lambda\. We fix the oracle parameter in[Theorem˜2](https://arxiv.org/html/2605.14280#Thmtheorem2), says=1/2s=1/2\. We choosesv≍\(κλ​nλ\)d/\(2​γ\+d\)s\_\{v\}\\asymp\(\\kappa\_\{\\lambda\}n\_\{\\lambda\}\)^\{d/\(2\\gamma\+d\)\}for the approximation ofvλv\_\{\\lambda\}, and choose the multiplication precisionM≍log⁡\(κλ​nλ\)M\\asymp\\log\(\\kappa\_\{\\lambda\}n\_\{\\lambda\}\)\. The network sizes in[Section˜3](https://arxiv.org/html/2605.14280#S3)ensure the bounds \([16b](https://arxiv.org/html/2605.14280#A3.E16.2)\) hold, so we can apply[Section˜C\.1](https://arxiv.org/html/2605.14280#A3.SS1)\. For the \(fixed\) tripleβ,γ,d\\beta,\\gamma,d, theM​log⁡MM\\log Mterm is absorbed by the polynomial sparsity terms, and the depth requirement is of orderlog⁡\(κλ​nλ\)\\log\(\\kappa\_\{\\lambda\}n\_\{\\lambda\}\)\. Fornλ≥2n\_\{\\lambda\}\\geq 2, chooseδ=\(κλ​nλ\)−A\\delta=\(\\kappa\_\{\\lambda\}n\_\{\\lambda\}\)^\{\-A\}with a sufficiently large constantAA\. For the fixed value ofλ\\lambdaunder consideration, the remainder termRem⁡\(δ\)\\operatorname\{Rem\}\(\\delta\)in[Theorem˜2](https://arxiv.org/html/2605.14280#Thmtheorem2)is then of lower order than the rate below, while the additionallog⁡\(1/δ\)\\log\(1/\\delta\)factor is absorbed into the displayed logarithmic terms\. From the sparse ReLU entropy bound\(schmidt2020nonparametric, Lemma A\.1\)and using the polynomial width condition, we have \(up to logarithmic factors\) an upper bound of the form

log⁡𝒩n​\(δ;ℱ\+ℬ\)λ​n\+log⁡𝒩m​\(δ;ℬ\)m≲sf\+sgnλ​\(log⁡\(κλ​nλ\)\)2\.\\displaystyle\\frac\{\\log\\mathcal\{N\}\_\{n\}\(\\delta;\\mathscr\{F\}\+\\mathscr\{B\}\)\}\{\\lambda n\}\+\\frac\{\\log\\mathcal\{N\}\_\{m\}\(\\delta;\\mathscr\{B\}\)\}\{m\}\\lesssim\\frac\{s\_\{f\}\+s\_\{g\}\}\{n\_\{\\lambda\}\}\\,\\big\(\\log\(\\kappa\_\{\\lambda\}n\_\{\\lambda\}\)\\big\)^\{2\}\.With the auxiliary choices above, this quantity can be further upper bounded by

sf\+svnλ​\(log⁡\(κλ​nλ\)\)3\.\\displaystyle\\frac\{s\_\{f\}\+s\_\{v\}\}\{n\_\{\\lambda\}\}\\,\\big\(\\log\(\\kappa\_\{\\lambda\}n\_\{\\lambda\}\)\\big\)^\{3\}\.For the approximation term,[Section˜2\.1](https://arxiv.org/html/2605.14280#S2.SS1)and[Section˜C\.1](https://arxiv.org/html/2605.14280#A3.SS1)imply that

1λ​inff∈ℱ,b∈ℬ𝒜​\(f,b\)≲sf−2​β/d​log2⁡sf\+κλ​\(sf−2​β/d​log2⁡sf\+sv−2​γ/d​log2⁡sv\+2−2​M\)\.\\displaystyle\\frac\{1\}\{\\lambda\}\\inf\_\{f\\in\\mathscr\{F\},b\\in\\mathscr\{B\}\}\\mathcal\{A\}\(f,b\)\\lesssim s\_\{f\}^\{\-2\\beta/d\}\\log^\{2\}s\_\{f\}\+\\kappa\_\{\\lambda\}\\left\(s\_\{f\}^\{\-2\\beta/d\}\\log^\{2\}s\_\{f\}\+s\_\{v\}^\{\-2\\gamma/d\}\\log^\{2\}s\_\{v\}\+2^\{\-2M\}\\right\)\.TakingM≍log⁡\(κλ​nλ\)M\\asymp\\log\(\\kappa\_\{\\lambda\}n\_\{\\lambda\}\)makes the multiplication error negligible relative to the polynomial terms\. Thus[Theorem˜2](https://arxiv.org/html/2605.14280#Thmtheorem2)is controlled, up to logarithmic factors, by

\(1\+κλ\)​sf−2​β/d\+κλ​sv−2​γ/d\+sf\+svnλ\.\\displaystyle\(1\+\\kappa\_\{\\lambda\}\)s\_\{f\}^\{\-2\\beta/d\}\+\\kappa\_\{\\lambda\}s\_\{v\}^\{\-2\\gamma/d\}\+\\frac\{s\_\{f\}\+s\_\{v\}\}\{n\_\{\\lambda\}\}\.Sinceκλ\>1\\kappa\_\{\\lambda\}\>1, we can absorb1\+κλ1\+\\kappa\_\{\\lambda\}intoκλ\\kappa\_\{\\lambda\}\. Balancing the first and third terms yieldssf≍\(κλ​nλ\)d/\(2​β\+d\)s\_\{f\}\\asymp\(\\kappa\_\{\\lambda\}n\_\{\\lambda\}\)^\{d/\(2\\beta\+d\)\}, while balancing the second and third yieldssv≍\(κλ​nλ\)d/\(2​γ\+d\)s\_\{v\}\\asymp\(\\kappa\_\{\\lambda\}n\_\{\\lambda\}\)^\{d/\(2\\gamma\+d\)\}\. Substituting these proof choices gives

𝔼​ℰλ2​\(f^\)≲κλd2​β\+d​nλ−2​β2​β\+d​\(log⁡\(κλ​nλ\)\)3\+κλd2​γ\+d​nλ−2​γ2​γ\+d​\(log⁡\(κλ​nλ\)\)3,\\displaystyle\\mathbb\{E\}\\mathcal\{E\}\_\{\\lambda\}^\{2\}\(\\widehat\{f\}\)\\lesssim\\kappa\_\{\\lambda\}^\{\\frac\{d\}\{2\\beta\+d\}\}n\_\{\\lambda\}^\{\-\\frac\{2\\beta\}\{2\\beta\+d\}\}\\big\(\\log\(\\kappa\_\{\\lambda\}n\_\{\\lambda\}\)\\big\)^\{3\}\+\\kappa\_\{\\lambda\}^\{\\frac\{d\}\{2\\gamma\+d\}\}n\_\{\\lambda\}^\{\-\\frac\{2\\gamma\}\{2\\gamma\+d\}\}\\big\(\\log\(\\kappa\_\{\\lambda\}n\_\{\\lambda\}\)\\big\)^\{3\},as claimed\.

### C\.3Proof of[Section˜C\.1](https://arxiv.org/html/2605.14280#A3.SS1)

Part \(i\) follows directly from Theorem 5 in the paperschmidt2020nonparametric, so that it remains to prove part \(ii\)\.

Given an arbitraryf∈ℱf\\in\\mathscr\{F\}, our goal is to approximate the optimal offsetbf∗​\(x\)≔−vλ​\(x\)​\(f​\(x\)−f∗​\(x\)\)b^\{\*\}\_\{f\}\(x\)\\coloneqq\-v\_\{\\lambda\}\(x\)\(f\(x\)\-f^\{\*\}\(x\)\), wherevλ∈Cdγv\_\{\\lambda\}\\in C\_\{d\}^\{\\gamma\}by assumption\. Fix a sparsity indexs≥1s\\geq 1and multiplication parameterM≥1M\\geq 1\. Sincevλ∈Cdγv\_\{\\lambda\}\\in C\_\{d\}^\{\\gamma\}andf∗∈Cdβf^\{\*\}\\in C\_\{d\}^\{\\beta\}, Theorem 5 in the paperschmidt2020nonparametricguarantees the existence of networksv^λ∈𝙽𝙽​\(Lv,𝐩v,sv,1\)\\widehat\{v\}\_\{\\lambda\}\\in\\mathtt\{NN\}\(L\_\{v\},\\mathbf\{p\}\_\{v\},s\_\{v\},1\)andf∗^∈𝙽𝙽​\(Lf∗,𝐩f∗,sf,B\)\\hat\{f^\{\*\}\}\\in\\mathtt\{NN\}\(L\_\{f^\{\*\}\},\\mathbf\{p\}\_\{f^\{\*\}\},s\_\{f\},B\)such that

‖v^λ−vλ‖∞2≲sv−2​γ/d​log2⁡sv,‖f∗^−f∗‖∞2≲sf−2​β/d​log2⁡sf\.\\displaystyle\\\|\\widehat\{v\}\_\{\\lambda\}\-v\_\{\\lambda\}\\\|\_\{\\infty\}^\{2\}\\lesssim s\_\{v\}^\{\-2\\gamma/d\}\\log^\{2\}s\_\{v\},\\qquad\\\|\\hat\{f^\{\*\}\}\-f^\{\*\}\\\|\_\{\\infty\}^\{2\}\\lesssim s\_\{f\}^\{\-2\\beta/d\}\\log^\{2\}s\_\{f\}\.We constructbbusing the multiplication networkMultM\\mathrm\{Mult\}\_\{M\}from Lemma A\.2 in the paperschmidt2020nonparametric\. This network approximates the productu​vuvon\[0,1\]2\[0,1\]^\{2\}with error2−M2^\{\-M\}; by an affine rescaling, the same construction approximates products on\[0,1\]×\[−1,1\]\[0,1\]\\times\[\-1,1\]with the same order of error\. After the standard clipping step, we may takev^λ∈\[0,1\]\\widehat\{v\}\_\{\\lambda\}\\in\[0,1\]\. Since\|f−f∗^\|≤2​B\|f\-\\hat\{f^\{\*\}\}\|\\leq 2B, define

b​\(x\)=−2​B​MultM​\(v^λ​\(x\),f​\(x\)−f∗^​\(x\)2​B\)\.\\displaystyle b\(x\)=\-2B\\,\\mathrm\{Mult\}\_\{M\}\\left\(\\widehat\{v\}\_\{\\lambda\}\(x\),\\frac\{f\(x\)\-\\hat\{f^\{\*\}\}\(x\)\}\{2B\}\\right\)\.By the standard parallelization and composition rules, this network is contained inℬ\\mathscr\{B\}whenever the pair\(Lg,sg\)\(L\_\{g\},s\_\{g\}\)satisfies the lower bounds \([16b](https://arxiv.org/html/2605.14280#A3.E16.2)\)\.

For the error analysis, letΔf​\(x\)=f​\(x\)−f∗​\(x\)\\Delta\_\{f\}\(x\)=f\(x\)\-f^\{\*\}\(x\)andΔ^f​\(x\)=f​\(x\)−f∗^​\(x\)\\hat\{\\Delta\}\_\{f\}\(x\)=f\(x\)\-\\hat\{f^\{\*\}\}\(x\)\. By the triangle inequality, we have

\|b​\(x\)−bf∗​\(x\)\|\\displaystyle\|b\(x\)\-b^\{\*\}\_\{f\}\(x\)\|≤b\(x\)\+v^λ\(x\)Δ^f\(x\)\|\|\+v^λ\(x\)Δ^f\(x\)−vλ\(x\)Δf\(x\)\|\\displaystyle\\leq b\(x\)\+\\widehat\{v\}\_\{\\lambda\}\(x\)\\hat\{\\Delta\}\_\{f\}\(x\)\|\|\+\\widehat\{v\}\_\{\\lambda\}\(x\)\\hat\{\\Delta\}\_\{f\}\(x\)\-v\_\{\\lambda\}\(x\)\\Delta\_\{f\}\(x\)\|≲2​B​2−M\+‖v^λ−vλ‖∞​\|f​\(x\)−f∗​\(x\)\|\+‖f∗^−f∗‖∞\.\\displaystyle\\lesssim 2B\\,2^\{\-M\}\+\\\|\\widehat\{v\}\_\{\\lambda\}\-v\_\{\\lambda\}\\\|\_\{\\infty\}\\,\|f\(x\)\-f^\{\*\}\(x\)\|\+\\\|\\hat\{f^\{\*\}\}\-f^\{\*\}\\\|\_\{\\infty\}\.The functions inℱ\\mathscr\{F\}are bounded byBB, and‖f∗‖∞≤K\\\|f^\{\*\}\\\|\_\{\\infty\}\\leq Kby the Hölder condition onf∗f^\{\*\}\. Integrating with respect toℙλ\\mathbb\{P\}\_\{\\lambda\}and using the upper bound\(a\+b\+c\)2≤3​\(a2\+b2\+c2\)\(a\+b\+c\)^\{2\}\\leq 3\(a^\{2\}\+b^\{2\}\+c^\{2\}\)yields

‖b−bf∗‖ℙλ2≲2−2​M\+sv−2​γ/d​log2⁡sv\+sf−2​β/d​log2⁡sf\.\\displaystyle\\\|b\-b^\{\*\}\_\{f\}\\\|\_\{\\mathbb\{P\}\_\{\\lambda\}\}^\{2\}\\lesssim 2^\{\-2M\}\+s\_\{v\}^\{\-2\\gamma/d\}\\log^\{2\}s\_\{v\}\+s\_\{f\}^\{\-2\\beta/d\}\\log^\{2\}s\_\{f\}\.This proves the claimed bound\.

### C\.4Auxiliary lemmas

Here we restate an auxiliary lemma from the paperschmidt2020nonparametric\(cf\. equation \(42\)\)\.

###### Lemma 9\(Concentration ofL2L\_\{2\}risk\)\. LetS=\(x1,…,xn\)S=\(x\_\{1\},\\dots,x\_\{n\}\)be an iid sample fromPP, and let𝔼S\\mathbb\{E\}\_\{S\}denote expectation overSS\. Letℋ\\mathcal\{H\}be a class of functions bounded byBℋB\_\{\\mathcal\{H\}\}, and leth=hSh=h\_\{S\}be a measurable sample\-dependent element ofℋ\\mathcal\{H\}\. Define‖hS‖P2=𝔼X∼P​\[hS​\(X\)2\]\\\|h\_\{S\}\\\|\_\{P\}^\{2\}=\\mathbb\{E\}\_\{X\\sim P\}\[h\_\{S\}\(X\)^\{2\}\]and‖hS‖n2=1n​∑i=1nhS​\(xi\)2\\\|h\_\{S\}\\\|\_\{n\}^\{2\}=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}h\_\{S\}\(x\_\{i\}\)^\{2\}\. Then\|𝔼S​\[‖hS‖P2\]−𝔼S​\[‖hS‖n2\]\|\\displaystyle\\left\|\\mathbb\{E\}\_\{S\}\[\\\|h\_\{S\}\\\|\_\{P\}^\{2\}\]\-\\mathbb\{E\}\_\{S\}\[\\\|h\_\{S\}\\\|\_\{n\}^\{2\}\]\\right\|≤Bℋn​𝔼S​\[‖hS‖P2\]​36​n​log⁡𝒩n​\(δ;ℋ\)\+28​n\\displaystyle\\leq\\frac\{B\_\{\\mathcal\{H\}\}\}\{n\}\\sqrt\{\\mathbb\{E\}\_\{S\}\[\\\|h\_\{S\}\\\|\_\{P\}^\{2\}\]\}\\sqrt\{36n\\log\\mathcal\{N\}\_\{n\}\(\\delta;\\mathcal\{H\}\)\+2^\{8\}n\}\+Bℋ2​6​log⁡𝒩n​\(δ;ℋ\)\+11n\+26​δ​Bℋ,\\displaystyle\\quad\+B\_\{\\mathcal\{H\}\}^\{2\}\\frac\{6\\log\\mathcal\{N\}\_\{n\}\(\\delta;\\mathcal\{H\}\)\+11\}\{n\}\+26\\delta B\_\{\\mathcal\{H\}\},where𝒩n​\(δ;ℋ\)\\mathcal\{N\}\_\{n\}\(\\delta;\\mathcal\{H\}\)is an almost\-sure deterministic upper bound on theδ\\delta\-covering number ofℋ\\mathcal\{H\}with respect to the empiricalL∞L\_\{\\infty\}\-norm induced bySS\.

## Appendix DGeneralization to Bregman divergences

The construction in[Section˜2\.1](https://arxiv.org/html/2605.14280#S2.SS1)is the least\-squares instantiation of a more general Bregman\-based method\. In this section, we describe this more general theory, and its instantiations forKK\-ary classification problems, as described in[Section˜2\.2](https://arxiv.org/html/2605.14280#S2.SS2)\.

### D\.1Bregman formulation and decomposition

Let≔∗∪\{∞\}\{\}\_\{\*\}\\coloneqq\\cup\\\{\\infty\\\}denote the extended real line\. Any Bregman divergence is defined by a functionψ:Ω→∗\\psi:\\Omega\\to\_\{\*\}that is strictly convex, continuously differentiable, and of Legendre type\(rockafellar1970convex\)\. The functionψ\\psihas a Fenchel conjugateψ∗:Ω∗→∗\\psi^\{\*\}:\\Omega^\{\*\}\\to\_\{\*\}, and the gradient mapping∇ψ\\nabla\\psimaps between the interior ofΩ\\OmegaandΩ∗\\Omega^\{\*\}in a one\-to\-one way\.

The simplest example is the quadratic functionψ​\(z\)=12​‖z‖22\\psi\(z\)=\\frac\{1\}\{2\}\\\|z\\\|\_\{2\}^\{2\}with domainΩ=K\\Omega=^\{K\}\. It is self\-dual with Fenchel conjugateψ∗=ψ\\psi^\{\*\}=\\psiand domainΩ∗=K\\Omega^\{\*\}=^\{K\}\. ForKK\-aray classification problems, the logistic functionψ​\(z\)=log⁡\(∑k=1Kezk\)\\psi\(z\)=\\log\\big\(\\sum\_\{k=1\}^\{K\}e^\{z\_\{k\}\}\\big\)arises in the standard loss function\. It has domainΩ=K\\Omega=^\{K\}and Fenchel conjugateψ∗\\psi^\{\*\}given by the entropy function defined on the simplexΩ∗=\{u∈K∣u≥0,∑j=1kuj=1\}\\Omega^\{\*\}=\\\{u\\in^\{K\}\\mid u\\geq 0,\\sum\_\{j=1\}^\{k\}u\_\{j\}=1\\\}\.

Given a prediction functionf:𝒳→Ωf:\\mathcal\{X\}\\to\\Omega, we define the mean function associated withffas

μf​\(x\)≔∇ψ​\(f​\(x\)\)∈Ω∗,\\displaystyle\\mu\_\{f\}\(x\)\\coloneqq\\nabla\\psi\(f\(x\)\)\\in\\Omega^\{\*\},\(17a\)along with the optimal target in mean spaceμ⋆​\(x\)≔𝔼​\[Y∣X=x\]∈Ω∗,\\displaystyle\\mu^\{\\star\}\(x\)\\coloneqq\\mathbb\{E\}\[Y\\mid X=x\]\\in\\Omega^\{\*\},\(17b\)The canonical pointwise loss that compares this pair is given byℓ¯ψ​\(f;x\)≔Dψ∗​\(μ⋆​\(x\),μf​\(x\)\)\.\\displaystyle\\bar\{\\ell\}\_\{\\psi\}\(f;x\)\\coloneqq D\_\{\\psi^\{\*\}\}\\bigl\(\\mu^\{\\star\}\(x\),\\mu\_\{f\}\(x\)\\bigr\)\.\(17c\)Letppandqqdenote the source and target densities, and setvλ​\(x\)≔p​\(x\)p​\(x\)\+λ​q​\(x\),d​ρλ​\(x\)≔\(p​\(x\)\+λ​q​\(x\)\)​d​x\.\\displaystyle v\_\{\\lambda\}\(x\)\\coloneqq\\frac\{p\(x\)\}\{p\(x\)\+\\lambda q\(x\)\},\\qquad d\\rho\_\{\\lambda\}\(x\)\\coloneqq\\bigl\(p\(x\)\+\\lambda q\(x\)\\bigr\)dx\.\(17d\)Finally, for a convex functionϕ\\phiand a scalarη∈\[0,1\]\\eta\\in\[0,1\], we define the weighted Jensen divergenceJϕη​\(u,v\)\\displaystyle J\_\{\\phi\}^\{\\eta\}\(u,v\)≔η​ϕ​\(u\)\+\(1−η\)​ϕ​\(v\)−ϕ​\(η​u\+\(1−η\)​v\)\.\\displaystyle\\coloneqq\\eta\\phi\(u\)\+\(1\-\\eta\)\\phi\(v\)\-\\phi\\bigl\(\\eta u\+\(1\-\\eta\)v\\bigr\)\.\(17e\)
Now let us describe the Bregman analogue ofTILT, which is based on inserting an auxiliary correction in the dual mean parameter\. Consider a pair of functions\(f,b\)\(f,b\)that satisfy the pointwise inclusionμf​\(x\)\+b​\(x\)∈Ω∗\\mu\_\{f\}\(x\)\+b\(x\)\\in\\Omega^\{\*\}pointwise, we define:

Bregman\-TILT:𝒜ψ​\(f,b\)≔\\displaystyle\\mathcal\{A\}\_\{\\psi\}\(f,b\)\\coloneqq𝔼X∼ℙ​\[Dψ∗​\(μf​\(X\)\+b​\(X\),μ⋆​\(X\)\)\]\+λ​𝔼X~∼ℚ​\[Dψ∗​\(μf​\(X~\)\+b​\(X~\),μf​\(X~\)\)\]\.\\displaystyle\\mathbb\{E\}\_\{X\\sim\\mathbb\{P\}\}\\Bigl\[D\_\{\\psi^\{\*\}\}\\bigl\(\\mu\_\{f\}\(X\)\+b\(X\),\\mu^\{\\star\}\(X\)\\bigr\)\\Bigr\]\+\\lambda\\mathbb\{E\}\_\{\\widetilde\{X\}\\sim\\mathbb\{Q\}\}\\Bigl\[D\_\{\\psi^\{\*\}\}\\bigl\(\\mu\_\{f\}\(\\widetilde\{X\}\)\+b\(\\widetilde\{X\}\),\\mu\_\{f\}\(\\widetilde\{X\}\)\\bigr\)\\Bigr\]\.

Observe that the first term fits a corrected source prediction to the oracle conditional mean, whereas the second term penalizes the correction on target inputs\. It is a natural generalization of theTILTleast\-squares objective: in particular, if we use the Bregman functionψ​\(z\)=12​z2\\psi\(z\)=\\frac\{1\}\{2\}z^\{2\}, then we have the equivalences

Dψ∗\(μf\(X\)\+b\(X\),μ⋆\(X\)\)\]=\(μf\(X\)\+b\(X\)−μ⋆\(X\)\)2andDψ∗\(μf\(X~\)\+b\(X~\),μf\(X~\)\)=b2\(X~\),\\displaystyle D\_\{\\psi^\{\*\}\}\\bigl\(\\mu\_\{f\}\(X\)\+b\(X\),\\mu^\{\\star\}\(X\)\\bigr\)\\Bigr\]\\;=\\;\\big\(\\mu\_\{f\}\(X\)\+b\(X\)\-\\mu^\{\\star\}\(X\)\\big\)^\{2\}\\quad\\mbox\{and\}\\quad D\_\{\\psi^\{\*\}\}\\bigl\(\\mu\_\{f\}\(\\widetilde\{X\}\)\+b\(\\widetilde\{X\}\),\\mu\_\{f\}\(\\widetilde\{X\}\)\\bigr\)=b^\{2\}\(\\widetilde\{X\}\),so that we recover the originalTILTleast\-squares formulation\.

To state the decomposition, introduce the natural parameters

θ∗​\(x\)≔∇ψ∗​\(μ⋆​\(x\)\),θf​\(x\)≔∇ψ∗​\(μf​\(x\)\)=f​\(x\),\\displaystyle\\theta^\{\*\}\(x\)\\coloneqq\\nabla\\psi^\{\*\}\\bigl\(\\mu^\{\\star\}\(x\)\\bigr\),\\qquad\\theta\_\{f\}\(x\)\\coloneqq\\nabla\\psi^\{\*\}\\bigl\(\\mu\_\{f\}\(x\)\\bigr\)=f\(x\),\(18a\)along with theirλ\\lambda\-dependent barycenterθ¯f,λ​\(x\)≔vλ​\(x\)​θ∗​\(x\)\+\(1−vλ​\(x\)\)​θf​\(x\)\.\\displaystyle\\bar\{\\theta\}\_\{f,\\lambda\}\(x\)\\coloneqq v\_\{\\lambda\}\(x\)\\theta^\{\*\}\(x\)\+\\bigl\(1\-v\_\{\\lambda\}\(x\)\\bigr\)\\theta\_\{f\}\(x\)\.\(18b\)
###### Proposition 10\(Exact Bregman decomposition\)\. For any admissible pair\(f,b\)\(f,b\), we have the decomposition𝒜ψ​\(f,b\)\\displaystyle\\mathcal\{A\}\_\{\\psi\}\(f,b\)=∫Jψvλ​\(x\)​\(θ∗​\(x\),θf​\(x\)\)​𝑑ρλ​\(x\)\+∫Dψ​\(θ¯f,λ​\(x\),∇ψ∗​\(μf​\(x\)\+b​\(x\)\)\)​𝑑ρλ​\(x\)\.\\displaystyle=\\int J\_\{\\psi\}^\{v\_\{\\lambda\}\(x\)\}\\bigl\(\\theta^\{\*\}\(x\),\\theta\_\{f\}\(x\)\\bigr\)d\\rho\_\{\\lambda\}\(x\)\+\\int D\_\{\\psi\}\\Bigl\(\\bar\{\\theta\}\_\{f,\\lambda\}\(x\),\\nabla\\psi^\{\*\}\\bigl\(\\mu\_\{f\}\(x\)\+b\(x\)\\bigr\)\\Bigr\)d\\rho\_\{\\lambda\}\(x\)\.\(19\)

###### Proof\.

From classical results, Bregman divergences satisfy the primal\-dual identity

Dψ∗​\(u,v\)=Dψ​\(∇ψ∗​\(v\),∇ψ∗​\(u\)\),\\displaystyle D\_\{\\psi^\{\*\}\}\(u,v\)=D\_\{\\psi\}\\bigl\(\\nabla\\psi^\{\*\}\(v\),\\nabla\\psi^\{\*\}\(u\)\\bigr\),Using this fact, we can rewrite the integrand that defines𝒜ψ​\(f,b\)\\mathcal\{A\}\_\{\\psi\}\(f,b\)as

p​\(x\)​Dψ​\(θ∗​\(x\),∇ψ∗​\(μf​\(x\)\+b​\(x\)\)\)\+λ​q​\(x\)​Dψ​\(θf​\(x\),∇ψ∗​\(μf​\(x\)\+b​\(x\)\)\)\.\\displaystyle p\(x\)D\_\{\\psi\}\\Bigl\(\\theta^\{\*\}\(x\),\\nabla\\psi^\{\*\}\\bigl\(\\mu\_\{f\}\(x\)\+b\(x\)\\bigr\)\\Bigr\)\+\\lambda q\(x\)D\_\{\\psi\}\\Bigl\(\\theta\_\{f\}\(x\),\\nabla\\psi^\{\*\}\\bigl\(\\mu\_\{f\}\(x\)\+b\(x\)\\bigr\)\\Bigr\)\.For any convexϕ\\phiand anyη∈\[0,1\]\\eta\\in\[0,1\], the weighted Jensen divergenceJϕηJ\_\{\\phi\}^\{\\eta\}satisfies the identity

η​Dϕ​\(u,z\)\+\(1−η\)​Dϕ​\(v,z\)=Jϕη​\(u,v\)\+Dϕ​\(η​u\+\(1−η\)​v,z\)\.\\displaystyle\\eta D\_\{\\phi\}\(u,z\)\+\(1\-\\eta\)D\_\{\\phi\}\(v,z\)=J\_\{\\phi\}^\{\\eta\}\(u,v\)\+D\_\{\\phi\}\\bigl\(\\eta u\+\(1\-\\eta\)v,z\\bigr\)\.We apply this fact with the choicesϕ=ψ\\phi=\\psi,η=vλ​\(x\)\\eta=v\_\{\\lambda\}\(x\),u=θ∗​\(x\)u=\\theta^\{\*\}\(x\),v=θf​\(x\)v=\\theta\_\{f\}\(x\), andz=∇ψ∗​\(μf​\(x\)\+b​\(x\)\)z=\\nabla\\psi^\{\*\}\(\\mu\_\{f\}\(x\)\+b\(x\)\), thereby obtaining the claim \([19](https://arxiv.org/html/2605.14280#A4.E19)\)\. ∎

##### Profiled target limit:

We now describe the natural generalization of[Section˜2\.1](https://arxiv.org/html/2605.14280#S2.SS1)\. By inspection of equation \([19](https://arxiv.org/html/2605.14280#A4.E19)\), if we minimize the function𝒜ψ​\(f,b\)\\mathcal\{A\}\_\{\\psi\}\(f,b\)over the choice ofbb, then the second term vanishes and we obtain

𝒥ψ,λ​\(f\)≔infb𝒜ψ​\(f,b\)=∫Jψvλ​\(x\)​\(θ∗​\(x\),θf​\(x\)\)​𝑑ρλ​\(x\)\.\\displaystyle\\mathcal\{J\}\_\{\\psi,\\lambda\}\(f\)\\coloneqq\\inf\_\{b\}\\mathcal\{A\}\_\{\\psi\}\(f,b\)=\\int J\_\{\\psi\}^\{v\_\{\\lambda\}\(x\)\}\\bigl\(\\theta^\{\*\}\(x\),\\theta\_\{f\}\(x\)\\bigr\)d\\rho\_\{\\lambda\}\(x\)\.Note that the functionbf,λb\_\{f,\\lambda\}that achieves the infimum is defined pointwise by the relation∇ψ∗​\(μf​\(x\)\+bf,λ​\(x\)\)=θ¯f,λ​\(x\)\\nabla\\psi^\{\*\}\\bigl\(\\mu\_\{f\}\(x\)\+b\_\{f,\\lambda\}\(x\)\\bigr\)=\\bar\{\\theta\}\_\{f,\\lambda\}\(x\), or equivalently, via the Legendre properties

bf,λ​\(x\)=∇ψ​\(θ¯f,λ​\(x\)\)−μf​\(x\),\\displaystyle b\_\{f,\\lambda\}\(x\)=\\nabla\\psi\\bigl\(\\bar\{\\theta\}\_\{f,\\lambda\}\(x\)\\bigr\)\-\\mu\_\{f\}\(x\),\(20\)where we have used the fact that∇ψ\\nabla\\psiis the inverse of∇ψ∗\\nabla\\psi^\{\*\}\.

In the least squares case, taking the limit of the profiled risk asλ→0\+\\lambda\\rightarrow 0^\{\+\}led to theL2​\(ℚ\)L^\{2\}\(\\mathbb\{Q\}\)\-target risk\. We now state and prove the generalization of this property to the Bregman setting:

###### Proposition 11\(Small\-λ\\lambdalimit\)\. Assume thatμ⋆​\(x\)\\mu^\{\\star\}\(x\)lies in the interior ofΩ∗\\Omega^\{\*\}for almost everyxx, and that the displayed quantities are integrable\. Then for any predictorff, we havelimλ→0\+1λ​𝒥ψ,λ​\(f\)\\displaystyle\\lim\_\{\\lambda\\rightarrow 0^\{\+\}\}\\frac\{1\}\{\\lambda\}\\mathcal\{J\}\_\{\\psi,\\lambda\}\(f\)=𝔼X∼ℚX​\[Dψ∗​\(μ⋆​\(X\),μf​\(X\)\)\]\.\\displaystyle=\\mathbb\{E\}\_\{X\\sim\\mathbb\{Q\}\_\{X\}\}\\Bigl\[D\_\{\\psi^\{\*\}\}\\bigl\(\\mu^\{\\star\}\(X\),\\mu\_\{f\}\(X\)\\bigr\)\\Bigr\]\.

###### Proof\.

For fixedu,vu,vand anyε∈\[0,1\]\\varepsilon\\in\[0,1\], by a first\-order Taylor series expansion, we can write

Jψ1−ε​\(u,v\)=ε​Dψ​\(v,u\)\+o​\(ε\),\\displaystyle J\_\{\\psi\}^\{1\-\\varepsilon\}\(u,v\)=\\varepsilon D\_\{\\psi\}\(v,u\)\+o\(\\varepsilon\),valid asε→0\+\\varepsilon\\rightarrow 0^\{\+\}\. We apply this relation pointwise inxxwith the choices

ελ​\(x\)≔1−vλ​\(x\)=λ​q​\(x\)p​\(x\)\+λ​q​\(x\),\\displaystyle\\varepsilon\_\{\\lambda\}\(x\)\\coloneqq 1\-v\_\{\\lambda\}\(x\)=\\frac\{\\lambda q\(x\)\}\{p\(x\)\+\\lambda q\(x\)\},andu≔θ∗​\(x\)u\\coloneqq\\theta^\{\*\}\(x\)andv≔θf​\(x\)v\\coloneqq\\theta\_\{f\}\(x\)\. Doing so yields

\(p​\(x\)\+λ​q​\(x\)\)​Jψvλ​\(x\)​\(θ∗​\(x\),θf​\(x\)\)=λ​q​Dψ​\(θf​\(x\),θ∗​\(x\)\)\+o​\(λ\),\\displaystyle\\bigl\(p\(x\)\+\\lambda q\(x\)\\bigr\)J\_\{\\psi\}^\{v\_\{\\lambda\}\(x\)\}\(\\theta^\{\*\}\(x\),\\theta\_\{f\}\(x\)\)=\\lambda qD\_\{\\psi\}\(\\theta\_\{f\}\(x\),\\theta^\{\*\}\(x\)\)\+o\(\\lambda\),where have used the equivalence\(p​\(x\)\+λ​q​\(x\)\)​ελ​\(x\)=λ​q​\(x\)\\bigl\(p\(x\)\+\\lambda q\(x\)\\bigr\)\\varepsilon\_\{\\lambda\}\(x\)=\\lambda q\(x\)\. Finally, we use the duality relationDψ​\(θf,θ∗\)=Dψ∗​\(μ⋆,μf\)D\_\{\\psi\}\(\\theta\_\{f\},\\theta^\{\*\}\)=D\_\{\\psi^\{\*\}\}\(\\mu^\{\\star\},\\mu\_\{f\}\)\. Integrating overxxcompletes the proof\. ∎

### D\.2Special cases

Let us describe how this general formulation recovers both least\-squaresTILTdescribed in the main body, as well as a version ofKK\-ary classification\.

#### D\.2\.1Least\-squaresTILT

Consider the Legendre functionψ​\(z\)=12​‖z‖22\\psi\(z\)=\\frac\{1\}\{2\}\\\|z\\\|\_\{2\}^\{2\}\. It satisfies the dual relationψ=ψ∗\\psi=\\psi^\{\*\}and∇ψ​\(z\)=z\\nabla\\psi\(z\)=z, so thatμf​\(x\)=f​\(x\)\\mu\_\{f\}\(x\)=f\(x\)andμ⋆​\(x\)=f∗​\(x\)\\mu^\{\\star\}\(x\)=f^\{\*\}\(x\)\. Consequently, theϕ\\phi\-risk for this choice takes the form

𝒜ψ​\(f,b\)=𝔼X∼ℙX​\[12​‖f​\(X\)−f⋆​\(X\)\+b​\(X\)‖22\]\+λ​𝔼X~∼ℚX​\[12​‖b​\(X~\)‖22\]\.\\displaystyle\\mathcal\{A\}\_\{\\psi\}\(f,b\)=\\mathbb\{E\}\_\{X\\sim\\mathbb\{P\}\_\{X\}\}\\Bigl\[\\frac\{1\}\{2\}\\\|f\(X\)\-f^\{\\star\}\(X\)\+b\(X\)\\\|\_\{2\}^\{2\}\\Bigr\]\+\\lambda\\mathbb\{E\}\_\{\\widetilde\{X\}\\sim\\mathbb\{Q\}\_\{X\}\}\\Bigl\[\\frac\{1\}\{2\}\\\|b\(\\widetilde\{X\}\)\\\|\_\{2\}^\{2\}\\Bigr\]\.Up to the irrelevant factor1/21/2, this is exactly the originalTILTpopulation objective, as described in[Section˜2\.1](https://arxiv.org/html/2605.14280#S2.SS1)\.

#### D\.2\.2KL surrogates for logistic regression

ForKK\-class logistic regression, the natural generator is

ψ​\(z\)=log⁡\(∑k=1Kezk\),μf​\(x\)=∇ψ​\(f​\(x\)\)=π​\(f​\(x\)\),\\displaystyle\\psi\(z\)=\\log\\Bigl\(\\sum\_\{k=1\}^\{K\}e^\{z\_\{k\}\}\\Bigr\),\\qquad\\mu\_\{f\}\(x\)=\\nabla\\psi\(f\(x\)\)=\\pi\(f\(x\)\),whereπ\\pidenotes the softmax map\. Its conjugate is negative entropy on the probability simplex,

ψ∗​\(u\)=∑k=1Kuk​log⁡uk,\\displaystyle\\psi^\{\*\}\(u\)=\\sum\_\{k=1\}^\{K\}u\_\{k\}\\log u\_\{k\},and henceDψ∗​\(u,v\)=KL​\(u∥v\)D\_\{\\psi^\{\*\}\}\(u,v\)=\\mathrm\{KL\}\(u\\\|v\)corresponds to the Kullback–Leibler divergence\. Writingρ\(x\)≔ℙ\(Y=⋅∣X=x\)\\rho\(x\)\\coloneqq\\mathbb\{P\}\(Y=\\cdot\\mid X=x\), the population Bregman\-TILTobjective formally becomes

𝒜ψ​\(f,b\)=𝔼X∼ℙX​\[KL​\(μf​\(X\)\+b​\(X\)∥ρ​\(X\)\)\]\+λ​𝔼X~∼ℚX​\[KL​\(μf​\(X~\)\+b​\(X~\)∥μf​\(X~\)\)\]\.\\displaystyle\\mathcal\{A\}\_\{\\psi\}\(f,b\)=\\mathbb\{E\}\_\{X\\sim\\mathbb\{P\}\_\{X\}\}\\Bigl\[\\mathrm\{KL\}\\bigl\(\\mu\_\{f\}\(X\)\+b\(X\)\\,\\\|\\,\\rho\(X\)\\bigr\)\\Bigr\]\+\\lambda\\mathbb\{E\}\_\{\\widetilde\{X\}\\sim\\mathbb\{Q\}\_\{X\}\}\\Bigl\[\\mathrm\{KL\}\\bigl\(\\mu\_\{f\}\(\\widetilde\{X\}\)\+b\(\\widetilde\{X\}\)\\,\\\|\\,\\mu\_\{f\}\(\\widetilde\{X\}\)\\bigr\)\\Bigr\]\.\(21\)The corresponding profiled objective can be written explicitly\. Letθρ​\(x\)≔∇ψ∗​\(ρ​\(x\)\)\\theta^\{\\rho\}\(x\)\\coloneqq\\nabla\\psi^\{\*\}\(\\rho\(x\)\)be any canonical logit representative of the oracle class\-probability vector\. By[Section˜D\.1](https://arxiv.org/html/2605.14280#A4.SS1),

𝒥ψ,λKL​\(f\)≔infb𝒜ψ​\(f,b\)=∫Jψvλ​\(x\)​\(θρ​\(x\),f​\(x\)\)​𝑑ρλ​\(x\)\.\\displaystyle\\mathcal\{J\}\_\{\\psi,\\lambda\}^\{\\mathrm\{KL\}\}\(f\)\\coloneqq\\inf\_\{b\}\\mathcal\{A\}\_\{\\psi\}\(f,b\)=\\int J\_\{\\psi\}^\{v\_\{\\lambda\}\(x\)\}\\bigl\(\\theta^\{\\rho\}\(x\),f\(x\)\\bigr\)d\\rho\_\{\\lambda\}\(x\)\.For the log\-partition generator, this weighted Jensen divergence is

Jψη​\(θ,θ′\)=η​log​∑k=1Keθk\+\(1−η\)​log​∑k=1Keθk′−log​∑k=1Keη​θk\+\(1−η\)​θk′\.\\displaystyle J\_\{\\psi\}^\{\\eta\}\(\\theta,\\theta^\{\\prime\}\)=\\eta\\log\\sum\_\{k=1\}^\{K\}e^\{\\theta\_\{k\}\}\+\(1\-\\eta\)\\log\\sum\_\{k=1\}^\{K\}e^\{\\theta^\{\\prime\}\_\{k\}\}\-\\log\\sum\_\{k=1\}^\{K\}e^\{\\eta\\theta\_\{k\}\+\(1\-\\eta\)\\theta^\{\\prime\}\_\{k\}\}\.Choosing the equivalent normalized representativesθρ​\(x\)=log⁡ρ​\(x\)\\theta^\{\\rho\}\(x\)=\\log\\rho\(x\)andθf​\(x\)=log⁡μf​\(x\)\\theta\_\{f\}\(x\)=\\log\\mu\_\{f\}\(x\), for which both log\-partition terms vanish, gives the simpler probability\-coordinate form

𝒥ψ,λKL​\(f\)=∫−log⁡\{∑k=1Kρk​\(x\)vλ​\(x\)​μf,k​\(x\)1−vλ​\(x\)\}​d​ρλ​\(x\)\.\\displaystyle\\mathcal\{J\}\_\{\\psi,\\lambda\}^\{\\mathrm\{KL\}\}\(f\)=\\int\-\\log\\Biggl\\\{\\sum\_\{k=1\}^\{K\}\\rho\_\{k\}\(x\)^\{v\_\{\\lambda\}\(x\)\}\\mu\_\{f,k\}\(x\)^\{1\-v\_\{\\lambda\}\(x\)\}\\Biggr\\\}d\\rho\_\{\\lambda\}\(x\)\.Thus the KL profiled objective is a covariate\-weighted Chernoff\-type discrepancy between the oracle class probabilities and the predictor\. In the small\-λ\\lambdalimit,[Section˜D\.1](https://arxiv.org/html/2605.14280#A4.SS1.SSS0.Px1)gives

1λ​𝒥ψ,λKL​\(f\)⟶𝔼X~∼ℚX​\[KL​\(ρ​\(X~\)∥μf​\(X~\)\)\]\.\\displaystyle\\frac\{1\}\{\\lambda\}\\mathcal\{J\}\_\{\\psi,\\lambda\}^\{\\mathrm\{KL\}\}\(f\)\\longrightarrow\\mathbb\{E\}\_\{\\widetilde\{X\}\\sim\\mathbb\{Q\}\_\{X\}\}\\Bigl\[\\mathrm\{KL\}\\bigl\(\\rho\(\\widetilde\{X\}\)\\,\\\|\\,\\mu\_\{f\}\(\\widetilde\{X\}\)\\bigr\)\\Bigr\]\.This expression gives the correct geometry, but it is not yet a usable training objective\. First, hard labels cannot be substituted directly forρ​\(x\)\\rho\(x\): the termKL​\(μf​\(x\)\+b​\(x\)∥y\)\\mathrm\{KL\}\(\\mu\_\{f\}\(x\)\+b\(x\)\\\|y\)is infinite unless the corrected prediction puts all mass on the observed class\. Second, a mean\-space correction must obey the simplex constraintμf​\(x\)\+b​\(x\)∈ΔK−1\\mu\_\{f\}\(x\)\+b\(x\)\\in\\Delta^\{K\-1\}, which is awkward for neural\-network parametrizations\.

##### TheKL\-TILTsurrogate\.

We address these issues by using a soft teacher and by applying the auxiliary correction in logit space\. LetπT​\(z\)≔π​\(z/T\)\\pi\_\{T\}\(z\)\\coloneqq\\pi\(z/T\)be the temperature\-TTsoftmax, and letτ:𝒳→ℝK\\tau:\\mathcal\{X\}\\to\\mathbb\{R\}^\{K\}be teacher logits\. The direct surrogate for \([21](https://arxiv.org/html/2605.14280#A4.E21)\) is

ℒ^KL\-TILT​\(f,b;τ\)≔\\displaystyle\\widehat\{\\mathcal\{L\}\}\_\{\\texttt\{KL\-TILT\}\}\(f,b;\\tau\)\\coloneqq\\;T2n​∑i=1nKL​\(πT​\(f​\(xi\)\+b​\(xi\)\)∥πT​\(τ​\(xi\)\)\)\\displaystyle\\frac\{T^\{2\}\}\{n\}\\sum\_\{i=1\}^\{n\}\\mathrm\{KL\}\\Bigl\(\\pi\_\{T\}\\bigl\(f\(x\_\{i\}\)\+b\(x\_\{i\}\)\\bigr\)\\,\\Big\\\|\\,\\pi\_\{T\}\\bigl\(\\tau\(x\_\{i\}\)\\bigr\)\\Bigr\)\+λ​T2m​∑j=1mKL​\(πT​\(f​\(x~j\)\+b​\(x~j\)\)∥πT​\(f​\(x~j\)\)\)\.\\displaystyle\+\\frac\{\\lambda T^\{2\}\}\{m\}\\sum\_\{j=1\}^\{m\}\\mathrm\{KL\}\\Bigl\(\\pi\_\{T\}\\bigl\(f\(\\widetilde\{x\}\_\{j\}\)\+b\(\\widetilde\{x\}\_\{j\}\)\\bigr\)\\,\\Big\\\|\\,\\pi\_\{T\}\\bigl\(f\(\\widetilde\{x\}\_\{j\}\)\\bigr\)\\Bigr\)\.\(22\)The first KL term is the teacher\-based replacement for the source Bregman loss, and the second term is the target\-sideTILTpenalty\.

##### Relation toKD\-TILT\.

TheKD\-TILTobjective used in the experiments keeps the same target\-side penalty as \([22](https://arxiv.org/html/2605.14280#A4.E22)\), but replaces the source term by the usual distillation direction,

KL​\(πT​\(τ​\(xi\)\)∥πT​\(f​\(xi\)\+b​\(xi\)\)\),\\displaystyle\\mathrm\{KL\}\\Bigl\(\\pi\_\{T\}\\bigl\(\\tau\(x\_\{i\}\)\\bigr\)\\,\\Big\\\|\\,\\pi\_\{T\}\\bigl\(f\(x\_\{i\}\)\+b\(x\_\{i\}\)\\bigr\)\\Bigr\),optionally mixed with the supervised cross\-entropy loss, as in \([6](https://arxiv.org/html/2605.14280#S2.E6)\)\. ThusKL\-TILTis the most direct KL analogue of the Bregman\-TILTconstruction, whileKD\-TILTis its standard knowledge\-distillation surrogate\.

## Appendix EAdditional results from numerical studies

This appendix collects numerical results that complement the main experiments\. The synthetic plots give diagnostics for aspects that are harder to read from the main figures: how the one\-dimensional solution changes withλ\\lambda, how the learned auxiliary component behaves, and how the relative sizes ofℱ\\mathscr\{F\}andℬ\\mathscr\{B\}affect the fit\. The CIFAR\-100 plots report additional corruption types and the top\-5 metric\.

### E\.1Synthetic Regression Diagnostics

Effect ofλ\\lambdain a misspecified one\-dimensional problem:Setup\.We use the same one\-dimensional misspecified linear problem as in[Figure˜2](https://arxiv.org/html/2605.14280#S4.F2)A–B\. The target distribution is fixed at the Beta\(2,5\)\(2,5\)endpoint, while the source distribution is moved from the matched case toward the Beta\(5,2\)\(5,2\)endpoint\. The deployed classℱ\\mathscr\{F\}is a degree\-three shifted Legendre linear model, and the auxiliary classℬ\\mathscr\{B\}is a richer Gaussian\-kernel ridge expansion\. For source corruption levels0,0\.3,0\.70,0\.3,0\.7, and1\.01\.0, we sweepλ\\lambdaand report target\-test MSE forTILTand exactRuLSIF, with source ERM shown as a horizontal reference\.

![Refer to caption](https://arxiv.org/html/2605.14280v1/x4.png)Figure 5:Regularization sensitivity in the one\-dimensional synthetic regression experiment\.Each panel fixes a source corruption level and plots target\-test MSE as a function ofλ\\lambda\. Curves show means over trials and shaded bands show interquartile ranges\. TheTILTand exactRuLSIFcurves have visibly differentλ\\lambdadependence, and the favorable range forTILTchanges with the source corruption level\.In this one\-dimensional problem, the useful range ofλ\\lambdashifts with the corruption level, and the curve forTILTbehaves differently from exactRuLSIF\. Largeλ\\lambdamoves the estimator toward the source\-ERM reference, while smallλ\\lambdagives the auxiliary component a weaker target penalty\. The plot is therefore mainly a diagnostic of how the fitted solution changes withλ\\lambdain this particular misspecified setting\.

Auxiliary behavior in the one\-dimensional problem:Setup\.This diagnostic inspects the learned auxiliary componentbbin the same one\-dimensional experiment\. The diagnostics are computed at corruption levels0and11while sweepingλ\\lambda, and they track how much energybbcarries on source and target covariates together with its interaction with the source residual structure\.

![Refer to caption](https://arxiv.org/html/2605.14280v1/x5.png)Figure 6:Auxiliary\-component diagnostics for the synthetic regression experiment\.For the same one\-dimensional synthetic problem as in[Figure˜2](https://arxiv.org/html/2605.14280#S4.F2),Ashows diagnostic quantities for the learned auxiliary componentbbat corruption level0\. In the matched\-domain case, all quantities involvingbbremain small, and increasingλ\\lambdakeeps the target\-side contribution controlled\.Brepeats the diagnostics at corruption level11\. Under the large shift, the source\- and target\-side quantities separate strongly, consistent withbbfitting source\-specific residual structure while the target penalty limits its effect on the deployed componentff\.[Figure˜6](https://arxiv.org/html/2605.14280#A5.F6)complements the precedingλ\\lambdasweep\. When the domains are nearly matched, the auxiliary component remains small\. Under the larger shift, the diagnostics are consistent withbbcarrying more source\-specific residual variation, while the target\-side penalty limits how much of that variation is inherited by the deployed predictorff\.

Changing the auxiliary dimension in a bounded\-ratio problem:Setup\.To isolate the role ofλ\\lambdaand the complexity ofbb, we run a finite\-linear diagnostic under a bounded density ratio\. The source distribution is uniform on\[0,1\]\[0,1\], and the target distribution is obtained by a smooth raised\-cosine density ratio with0\.13≤q​\(x\)/p​\(x\)≤7\.80\.13\\leq q\(x\)/p\(x\)\\leq 7\.8\. We use320320source samples,320320target covariates, Gaussian noise with standard deviation0\.080\.08, and average over100100trials\. The deployed classℱ\\mathscr\{F\}is a shifted Legendre basis of dimensiondfd\_\{f\}, while the auxiliary classℬ\\mathscr\{B\}is a real Fourier basis of dimensiondbd\_\{b\}; the figure compares\(df,db\)∈\{\(20,8\),\(8,20\),\(8,8\),\(20,20\)\}\(d\_\{f\},d\_\{b\}\)\\in\\\{\(20,8\),\(8,20\),\(8,8\),\(20,20\)\\\}\.

![Refer to caption](https://arxiv.org/html/2605.14280v1/x6.png)Figure 7:Finite\-linearλ\\lambdasweep under bounded density ratio\.The four panels reproduce the first four diagnostics from the dimension sweep, except that the second panel reports\(1\+λ\)​ℰλ2\(1\+\\lambda\)\\mathcal\{E\}\_\{\\lambda\}^\{2\}rather than the rawℰλ2\\mathcal\{E\}\_\{\\lambda\}^\{2\}\. Whendf=20d\_\{f\}=20, the task is effectively well specified for the deployed class\. In this case, a smaller auxiliary class keeps the small\-λ\\lambdasolution closer to source ERM, whereas a rich auxiliary class can be worse at smallλ\\lambda\. Whendf=8d\_\{f\}=8, the deployed class is misspecified and a richer auxiliary class improves performance in this sweep\.[Figure˜7](https://arxiv.org/html/2605.14280#A5.F7)keeps the density ratio bounded and varies the dimensions offfandbb\. Whendf=20d\_\{f\}=20, the deployed class is close to well specified; in this case a smaller auxiliary class keeps the small\-λ\\lambdasolution close to source ERM, whereas making bothffandbbrich can be worse at very smallλ\\lambda\. Whendf=8d\_\{f\}=8, the deployed class is misspecified and the richer auxiliary class is helpful\. These finite\-dimensional curves are consistent with the view thatλ\\lambdacontrols how much auxiliary flexibility is used\.

A lower\-dimensional neural control:Setup\.This lower\-dimensional neural\-network control uses covariates in\[0,1\]16\[0,1\]^\{16\}with independent product\-Beta source and target marginals: the target endpoint is Beta\(2,5\)16\(2,5\)^\{16\}and the source endpoint is Beta\(5,2\)16\(5,2\)^\{16\}\. The regression function is generated by a fixed ReLU teacher network, and we use40964096labeled source samples,40964096unlabeled target covariates, target\-test sets of size6553665536, and2020trials\. The deployed predictorffis a one\-hidden\-layer ReLU network of width1616;TILTmay additionally use a deeper and wider auxiliary networkbb\. Source ERM, exactIW, exactRuLSIF, kernel\-estimatedRuLSIF, andTILTare tuned over the same validation protocol used for the other synthetic experiments\.

![Refer to caption](https://arxiv.org/html/2605.14280v1/x7.png)Figure 8:Well\-specified beta\-product neural control\.Target MSE is reported for a1616\-dimensional beta\-product covariate\-shift problem with a fixed ReLU\-teacher regression function\. Unlike the high\-dimensional weak\-class experiment in[Figure˜2](https://arxiv.org/html/2605.14280#S4.F2), the deployed neural class is sufficiently rich for this lower\-dimensional task\. Source ERM and kernel\-estimatedRuLSIFtherefore perform similarly toTILTin this setting\.[Figure˜8](https://arxiv.org/html/2605.14280#A5.F8)is included as a control\. In this lower\-dimensional setting, the deployed neural class is already rich enough that source ERM and kernel\-estimatedRuLSIFremain competitive withTILTover much of the shift range\. Thus this particular well\-specified regime is not a stress test forTILT\.

### E\.2Additional CIFAR\-100 Results

[Figure˜10](https://arxiv.org/html/2605.14280#A5.F10)shows that the single\-corruption results are not uniform across corruption types\. For Gaussian and defocus blur,KL\-TILTovertakes vanillaKDonce the shift is large enough\. Contrast is milder in this sweep: the teacher remains accurate over most of the path, and the benefit ofKL\-TILTappears only around severity1\.51\.5\. Pixelate shows little separation betweenKDandKL\-TILTin this configuration\. Thus these plots support the main CIFAR\-100 trend for several corruptions, but they also make clear that the size of the gain depends on the corruption\.

Metric robustness on the main CIFAR\-100 path:Setup\.We report the top\-5 metric for the same mixed\-corruption CIFAR\-100 experiment used in[Figure˜4](https://arxiv.org/html/2605.14280#S4.F4)\. The source is clean CIFAR\-100; the target path jointly changes brightness, contrast, color, and Gaussian blur; the methods are source ERM,KD,KD\-TILT, andKL\-TILTwith the same validation\-selected hyperparameters and seed aggregation as in the main comparison\.

![Refer to caption](https://arxiv.org/html/2605.14280v1/x8.png)Figure 9:Top\-5 accuracy on the CIFAR\-100 covariate\-shift experiment\.This figure complements[Figure˜4](https://arxiv.org/html/2605.14280#S4.F4)by reporting top\-5 accuracy on the shifted target\-test set for the same source ERM,KD,KD\-TILT, andKL\-TILTmethods\. Solid curves show means over seeds and shaded bands indicate±1\\pm 1standard deviation\.[Figure˜9](https://arxiv.org/html/2605.14280#A5.F9)gives the corresponding top\-5 view of the main CIFAR\-100 path\. The ordering is similar to the top\-1 and loss plots: tilted distillation is competitive at small shifts and becomes better than source ERM and vanillaKDonce the target corruption is larger\.

CIFAR\-100 behavior across individual corruption types:Setup\.This experiment tests whether the CIFAR\-100 behavior in[Figure˜4](https://arxiv.org/html/2605.14280#S4.F4)depends on the particular mixed corruption path used in the main text\. The source domain is clean CIFAR\-100, with the same source\-trained ResNet\-20 teacher, weak CNN studentff, and ResNet\-20 auxiliary logit modelbb\. On the target side, we apply one corruption type at a time: Gaussian blur, defocus blur, contrast, or pixelate, with severity in\{0,0\.5,1\.0,1\.5,2\.0\}\\\{0,0\.5,1\.0,1\.5,2\.0\\\}\. Brightness is omitted to keep the panel readable\. ForKL\-TILT, we report the best validation\-selected value overλ∈\{10,30,100\}\\lambda\\in\\\{10,30,100\\\}, and all student curves average three seeds\.

![Refer to caption](https://arxiv.org/html/2605.14280v1/x9.png)

![Refer to caption](https://arxiv.org/html/2605.14280v1/x10.png)

Figure 10:Target\-test cross\-entropy under single\-corruption CIFAR\-100 shifts\.This figure repeats the CIFAR\-100 distillation experiment while varying the target corruption type one at a time\. The top grid visualizes the same CIFAR\-100 test image under Gaussian blur, defocus blur, contrast, and pixelate corruptions; columns correspond to the severity values used in the sweep\. The bottom panels report target\-test cross\-entropy\. The teacher curve is the source\-trained ResNet\-20 evaluated directly on the shifted target\-test set\. TheKDandKL\-TILTcurves show means over three seeds with±1\\pm 1standard\-deviation bands;KL\-TILTdenotes the validation\-selected Bregman\-IIW/KL\-TILT model\.

Similar Articles

Online Localized Conformal Prediction

arXiv cs.LG

This paper proposes Online Localized Conformal Prediction (OLCP) to address covariate heterogeneity in online learning and time-series settings. It introduces OLCP-Hedge for bandwidth selection and demonstrates valid long-run coverage with narrower prediction sets compared to existing baselines.

Many-Shot CoT-ICL: Making In-Context Learning Truly Learn

Hugging Face Daily Papers

This paper investigates many-shot chain-of-thought in-context learning for reasoning tasks, revealing that standard scaling rules do not transfer and proposing Curvilinear Demonstration Selection (CDS) for improved ordering, achieving up to 5.42 percentage-point gain.