Representation Curriculum: Stagewise Training for Robust Ranking and Allocation
Summary
This paper proposes Representation Curriculum (RC), a training-time intervention that stages feature utilization to reduce over-reliance on exposure-confounded historical signals and improve cold-start generalization in ranking systems. The method is theoretically analyzed and validated on public benchmarks and large-scale eBay search experiments.
View Cached Full Text
Cached at: 06/10/26, 06:17 AM
# Representation Curriculum: Stagewise Training for Robust Ranking and Allocation
Source: [https://arxiv.org/html/2606.09891](https://arxiv.org/html/2606.09891)
,Sina BaharloueiandAbraham BagherjeiraneBay Search Ranking and MonetizationSan JoseCaliforniaUSA
\(1 June 2026\)
###### Abstract\.
Ranking in digital marketplaces is a dynamic exposure\-allocation mechanism: displayed items shape discovery trajectories and success events, which are logged by the platform to update future allocation policies\. Modern ranking systems therefore rely heavily on endogenous, exposure\-confounded signals \(e\.g\. popularity estimates, CTR/CVR aggregates, and ID\-based representation\), because they are highly predictive under approximately stationary demand and explain substantial variance in logged outcomes\. Yet this predictive power can become a learning shortcut: early access to exposure\-dependent*belief*signals in training steers optimization dynamics toward over\-reliance on them and away from exposure\-independent*merit*signals \(e\.g\. content\-based competitiveness and semantic intent affinity estimates\)\. Consequently, the learned policy tends to entrench incumbents and degrade cold\-start generalization and robustness under distribution shift\.
We proposeRepresentation Curriculum\(RC\), a semantics\-aware training\-time optimization\-trajectory intervention that temporally stages feature utilization\. RC foregrounds content\-based merit signals in the initial stage of training, then introduces exposure\-dependent historic belief signals while anchoring the content pathway to remain close to the learned merit representation, curbing shortcut reliance on historical signals and mitigating gradient starvation on content signals\. We formalize RC independently of task and hypothesis class and provide ranking\-specific instantiations\. In a Gaussian linear ridge setting, we derive closed\-form solutions and verifiable sufficient conditions under which RC strictly reduces population risk on a welfare\-aligned cold\-start target distribution, with a quantified Pareto tradeoff against source performance\. Experiments on public learning\-to\-rank and recommendation benchmarks, together with randomized online experiments in a large\-scale e\-commerce product search system, show that RC measurably shifts reliance from historical belief signals toward content\-based merit signals and yields consistent gains on cold populations with a controlled trade\-off in head performance\.
Learning to Rank; Curriculum Learning; Mechanism Design; Exploitation Bias; Cold Start
††journalyear:2026††copyright:cc††conference:Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\.2; August 09–13, 2026; Jeju Island, Republic of Korea††booktitle:Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\.2 \(KDD ’26\), August 09–13, 2026, Jeju Island, Republic of Korea††doi:10\.1145/3770855\.3818470††isbn:979\-8\-4007\-2259\-2/2026/08††ccs:Information systems Learning to rank††ccs:Applied computing Online shopping## 1\.Introduction
Modern ranking systems in search and recommendation make allocation decisions that shape how consumers discover inventory, how suppliers receive exposure, and how the marketplace evolves over time\. Because these policies mediate outcomes for multiple stakeholders, they are expected to account for the incentives of suppliers, consumers, and the platform and generalize reliably across heterogeneous traffic segments\. This is especially challenging in dynamic marketplaces, with high novelty, heterogeneity, and churn on the supply side, and non\-stationary demand and heterogeneous intent on the consumer side\. It is therefore crucial to maintain tighter control on*how**decision\-making primitives*are used to make welfare aligned allocations\. When rankers rely primarily on exposure\-shaped historical signals that are endogenous to the platform’s past allocations, they can exhibit self\-reinforcing generalization patterns that disproportionately favor incumbents and weaken discovery, fairness, and long\-run platform health\. Popularity\-driven generalization can lead to: \(i\) cold\-start fragility that harms discovery and new supplier on\-boarding, \(ii\) shifting the burden of discovery to users rather than relying on exposure\-independent content\-based merit signals, \(iii\) contextual intent misalignment when feedback is pooled across contexts to obtain statistically stable estimates, and \(iv\) self\-reinforcing exposure loops that entrench incumbents\(Chaneyet al\.,[2018](https://arxiv.org/html/2606.09891#bib.bib57)\)\.
Existing approaches mitigate this*popularity bias*\(Chenet al\.,[2023](https://arxiv.org/html/2606.09891#bib.bib32); Abdollahpouriet al\.,[2019](https://arxiv.org/html/2606.09891#bib.bib33)\)through: \(i\) correcting training data selection effects counterfactual learning techniques like propensity weighting; \(ii\) shaping supervision through distillation or reward modeling to encourage desired allocation behavior; \(iii\) shrinkage and uncertainty\-aware belief estimation \(e\.g\., empirical Bayes\); \(iv\) architectural constraints that limit history–content fusion with content pathways; and \(v\) stochastic masking/dropout as regularization techniques in training\. These methods are broadly complementary, but they are typically oblivious to data semantics and*how*learning allocates capacity across feature groups over time\. We argue, however, that early access to belief signals can dominate gradients and induce a stable shortcut solution that under\-learns content\-based merit and propose a semantics\-aware training trajectory intervention: delay access to belief signals and protect the content pathway, enabling explicit tradeoffs between head performance on logged data and robustness on welfare\-aligned target segments\.
### 1\.1\.Prior art
*\(1\) Counterfactual and causal learning from endogenous logs\.*Counterfactual learning to rank\(LTR\) approaches account for biased feedback \(e\.g\., position bias\) in empirical risk minimization via propensity weighting\(Swaminathan and Joachims,[2015](https://arxiv.org/html/2606.09891#bib.bib19); Joachimset al\.,[2017](https://arxiv.org/html/2606.09891#bib.bib17); Schnabelet al\.,[2016](https://arxiv.org/html/2606.09891#bib.bib18)\)\. Structural causal models treat popularity as a confounder and intervene explicitly\(Zhanget al\.,[2021](https://arxiv.org/html/2606.09891#bib.bib51); Weiet al\.,[2021](https://arxiv.org/html/2606.09891#bib.bib50)\)\. The broader goal of robustness to distribution shift has been studied under invariant prediction\(Peterset al\.,[2016](https://arxiv.org/html/2606.09891#bib.bib58)\), invariant risk minimization\(Arjovskyet al\.,[2019](https://arxiv.org/html/2606.09891#bib.bib59)\), and anchor regression\(Rothenhäusleret al\.,[2021](https://arxiv.org/html/2606.09891#bib.bib38)\)\. These works target identification and evaluation, but typically treat feature representations as given and do not directly control how the learner utilizes different feature groups\.
*\(2\) Supervision\-based behavior shaping\.*Platforms can encode preferences via distillation from a merit\-centric teacher\(Ebrahimzadehet al\.,[2024a](https://arxiv.org/html/2606.09891#bib.bib40); Hintonet al\.,[2015](https://arxiv.org/html/2606.09891#bib.bib45)\), constrained optimization for exposure fairness\(Singh and Joachims,[2018](https://arxiv.org/html/2606.09891#bib.bib20),[2019](https://arxiv.org/html/2606.09891#bib.bib21); Singhet al\.,[2021](https://arxiv.org/html/2606.09891#bib.bib22); Tanget al\.,[2024](https://arxiv.org/html/2606.09891#bib.bib66)\), or reward shaping and policy imitation\(Chenet al\.,[2022](https://arxiv.org/html/2606.09891#bib.bib64)\)\. These methods shape behavior through objectives, but do not necessarily prevent shortcut learning when belief signals dominate early optimization\.
*\(3\) History stabilization and shrinkage\.*Industrial systems use uncertainty\-aware belief estimates \(e\.g\., Empirical Bayes smoothing with global or content\-based priors\) to aggregate platform belief for LTR and product search\(Yanget al\.,[2022](https://arxiv.org/html/2606.09891#bib.bib14),[2024](https://arxiv.org/html/2606.09891#bib.bib15); Hanet al\.,[2022](https://arxiv.org/html/2606.09891#bib.bib16); Ardywibowoet al\.,[2025](https://arxiv.org/html/2606.09891#bib.bib67)\)\. EB improves the quality of exposure\-dependent signals via reduces variance and calibrated uncertainty, but it does not directly address the learning dynamics problem and belief signals can still dominate learning\.
*\(4\) Architectural constraints and explicit decompositions\.*Multi\-tower architectures, gating, and limited fusion constrain how historic belief and content\-based merit representations combine\(Volkovset al\.,[2017](https://arxiv.org/html/2606.09891#bib.bib10); Wanget al\.,[2018a](https://arxiv.org/html/2606.09891#bib.bib11)\)\. They can enforce interpretable policy structure but may under\-perform if explicit decomposition is too strict\.
*\(5\) Optimization\-trajectory interventions\.*Training dynamics influence the learned representations and strong predictors can suppress gradient signal for weaker cues in neural networks\(Pezeshkiet al\.,[2021](https://arxiv.org/html/2606.09891#bib.bib9)\)\. Generic regularizers \(dropout\) and stochastic masking reduce reliance on any single feature group\(Srivastavaet al\.,[2014](https://arxiv.org/html/2606.09891#bib.bib36); Volkovset al\.,[2017](https://arxiv.org/html/2606.09891#bib.bib10); Wanget al\.,[2018a](https://arxiv.org/html/2606.09891#bib.bib11)\), but are usually semantics\-agnostic and not welfare aligned\. Our approach differs from standard curriculum learning\(Ferroet al\.,[2018](https://arxiv.org/html/2606.09891#bib.bib69); Zenget al\.,[2022](https://arxiv.org/html/2606.09891#bib.bib68)\)in that we shape the properties of the model by explicit anchoring to the content pathway\.
### 1\.2\.Contributions
#### A semantics\-aware representation curriculum for robust allocation:
We propose*Representation Curriculum*\(RC\), which partitions features into content\-based merit signals and exposure\-dependent historic belief signals and temporally stages the training\. Stage 1 learns a model trained only on content\-based signals\. Stage 2 introduces historic signals but anchors the optimization so that content\-based competence is preserved\. This feature\-access curriculum complements the standard approaches to mitigate over\-reliance on historic belief signals by controlling how the learner relies on them\.
#### Theoretical Guarantees with quantified trade\-offs and Verifiable Sufficient Conditions:
We analyze RC in linear ridge regression, providing verifiable conditions under which RC reduces population risk on a target distribution that emphasizes cold/strategic segments, while remaining competitive on the logged data distribution\. We extend the lens to boosted trees via base\-margin curricula and to neural ranking, where we connect pairwise/listwise saturation to feature starvation and provide practical diagnostics\.
#### Experimental Evidence on Public Benchmarks and Deployed Systems:
On MSLR\-WEB, we identify highly predictive historic belief based features that encode behavioral evidence, and show how RC reduces feature importance on these historic belief signals and improves cold\-item ranking quality with minimal impact on overall performance\. On MovieLens, RC shapes neural two\-tower recommenders toward stronger content reliance and yields stronger frozen\-start performance in the absence of historic signals\. Finally, an online A/B test in a major e\-commerce sponsored search system establishes that a policy trained via RC increases exposure and sale velocity for new listings with neutral aggregate KPIs, situating RC as an effective behavior shaping technique in large scale systems\.
## 2\.Problem Setup: Robust Policy Updates under Endogenous History Signals
We formalize ranking as an allocation mechanism with two semantically distinct classes of features:*exposure\-independent, content\-based merit signals*and*exposure\-dependent historical belief signals*\. At iterationtt, a deployed policyπt\\pi\_\{t\}maps a contextcc\(query, session, user intent state\) and a candidate set to a ranked slate\(or a distribution over rankings\)\. Exposure induces interactions, and a data sampling policy produces a logged dataset𝒟t\\mathcal\{D\}\_\{t\}, which is, in turn, used to train a new policyπt\+1\\pi\_\{t\+1\}; that is
\(1\)πt⟶𝒟t∼𝒫\(πt\)⟶πt\+1,\\pi\_\{t\}\\ \\longrightarrow\\ \\mathcal\{D\}\_\{t\}\\sim\\mathcal\{P\}\(\\pi\_\{t\}\)\\ \\longrightarrow\\ \\pi\_\{t\+1\},where𝒫\(πt\)\\mathcal\{P\}\(\\pi\_\{t\}\)denotes the data distribution induced by deployment \(including selection effects and feedback\)\. A key challenge is that𝒫\(πt\)\\mathcal\{P\}\(\\pi\_\{t\}\)is*endogenous*: which contexts and items appear, and which outcomes are observed, are shaped by the deployed policy\. Platforms rarely optimize solely for the population represented in logged data\. Instead, they choose a*target distribution*𝒬\\mathcal\{Q\}that reflects strategic or welfare\-aligned goals \(e\.g\., improving long\-tail coverage, supporting new supply, prioritizing high\-value intents, or meeting fairness constraints\)\. Our aim is to train the next policy so that it generalizes to a platform\-chosen target population and does not over\-rely on endogenous belief features\. If each update is robust to the𝒫\(πt\)\\mathcal\{P\}\(\\pi\_\{t\}\)–𝒬\\mathcal\{Q\}mismatch, then repeated deployment is less likely to amplify feedback loops\. Over time, this helps maintain alignment with long\-run platform objectives\.
Each training example is a tuple𝐳=\(c,i,y,𝐱\)\\mathbf\{z\}=\(c,i,y,\\mathbf\{x\}\), whereccis a context,iiis an item,yyis a label \(e\.g\., relevance, click, conversion\), and𝐱\\mathbf\{x\}is a feature vector\. We assume the features decompose into two semantic groups,𝐱≡\(𝐱C,𝐱H\)\\mathbf\{x\}\\equiv\(\\mathbf\{x\}\_\{C\},\\mathbf\{x\}\_\{H\}\), where
- •𝐱C\\mathbf\{x\}\_\{C\}arecontent\-based signals, meaning features available regardless of prior exposure \(e\.g\., query–item affinity, item attributes, price and logistics signals, content embeddings\)\.
- •𝐱H\\mathbf\{x\}\_\{H\}areexposure\-dependent historical belief signals, meaning features derived from past exposure and interactions \(e\.g\., stabilized CTR/CVR estimates, historical transactions, item IDs, or history\-derived embeddings\)\.
Historical belief signals𝐱H\\mathbf\{x\}\_\{H\}are powerful because they compactly summarize accumulated interaction evidence under past allocations\. However, they are also*policy\-mediated*: their distribution depends on the platform’s prior exposure decisions\. As a result, for strategic segments such as cold\-start or under\-exposed inventory,𝐱H\\mathbf\{x\}\_\{H\}can differ substantially under the target population𝒬\\mathcal\{Q\}, which upweights such segments, relative to the logging distribution𝒫\(πt\)\\mathcal\{P\}\(\\pi\_\{t\}\)\.
### 2\.1\.Over\-Reliance on Endogenous Signals
To motivate semantics\-aware control over learning dynamics, we first illustrate the pathology in a minimal setting\. Consider a linear predictorf𝐰\(𝐱\)=𝐰C⊤𝐱C\+𝐰H⊤𝐱Hf\_\{\\mathbf\{w\}\}\(\\mathbf\{x\}\)=\\mathbf\{w\}\_\{C\}^\{\\top\}\\mathbf\{x\}\_\{C\}\+\\mathbf\{w\}\_\{H\}^\{\\top\}\\mathbf\{x\}\_\{H\}trained to minimize expected loss under the logged distribution𝒫\\mathcal\{P\}\(induced by deployment\)\. Suppose𝐱H\\mathbf\{x\}\_\{H\}is strongly predictive under𝒫\\mathcal\{P\}because it aggregates exposure\-conditioned evidence and explains variance in selected logged contexts, while𝐱C\\mathbf\{x\}\_\{C\}is moderately predictive but broadly available\. If the learning algorithm observes\(𝐱C,𝐱H\)\(\\mathbf\{x\}\_\{C\},\\mathbf\{x\}\_\{H\}\)jointly from the start, it will typically assign large weight to𝐱H\\mathbf\{x\}\_\{H\}because it yields the fastest reduction in empirical loss\. This can lead to*over\-reliance*:
‖𝐰H‖becomes large while‖𝐰C‖remains small\.\\\|\\mathbf\{w\}\_\{H\}\\\|\\text\{ becomes large while \}\\\|\\mathbf\{w\}\_\{C\}\\\|\\text\{ remains small\.\}Now consider a target population𝒬\\mathcal\{Q\}that deviates from𝒫\\mathcal\{P\}, for example by upweighting the cold\-start items\. In these segments,𝐱H\\mathbf\{x\}\_\{H\}is a weak predictor, so predictive performance depends disproportionately on the learned content pathway𝐰C\\mathbf\{w\}\_\{C\}\. Thus, a model that is highly accurate under𝒫\\mathcal\{P\}can generalize poorly under𝒬\\mathcal\{Q\}\.
This effect is beyond merely ‘feature missingness” in the target population and reflects a broader shift in the joint distribution\(c,i,y\)\(c,i,y\): the platform strategically cares about success events for items and contexts that are underrepresented under𝒫\(πt\)\\mathcal\{P\}\(\\pi\_\{t\}\)\.
The same phenomenon appears sharply in pairwise learning\-to\-rank\. Lets𝜽\(c,i\)s\_\{\\boldsymbol\{\\theta\}\}\(c,i\)be a scoring function and consider the standard pairwise logistic loss used in LambdaLoss\-style training\(Wanget al\.,[2018b](https://arxiv.org/html/2606.09891#bib.bib63)\):
\(2\)ℓ\(𝜽;c,i,j\)=log\(1\+exp\(−\(s𝜽\(c,i\)−s𝜽\(c,j\)\)\)\)\.\\ell\(\\boldsymbol\{\\theta\};c,i,j\)\\;=\\;\\log\\bigg\(1\+\\exp\\Big\(\-\\big\(s\_\{\\boldsymbol\{\\theta\}\}\(c,i\)\-s\_\{\\boldsymbol\{\\theta\}\}\(c,j\)\\big\)\\Big\)\\bigg\)\.For brevity, letsi:=s𝜽\(c,i\)s\_\{i\}:=s\_\{\\boldsymbol\{\\theta\}\}\(c,i\)andsj:=s𝜽\(c,j\)s\_\{j\}:=s\_\{\\boldsymbol\{\\theta\}\}\(c,j\)\. The gradient magnitude with respect to the score difference is proportional toσ\(−\(si−sj\)\)\\sigma\\big\(\-\(s\_\{i\}\-s\_\{j\}\)\\big\), whereσ\\sigmais the sigmoid\.
If exposure\-dependent belief signals𝐱H\\mathbf\{x\}\_\{H\}provide a highly predictive shortcut, early training can quickly create large score gaps\|si−sj\|\|s\_\{i\}\-s\_\{j\}\|by exploiting𝐱H\\mathbf\{x\}\_\{H\}\. Thenσ\(−\(si−sj\)\)\\sigma\\big\(\-\(s\_\{i\}\-s\_\{j\}\)\\big\)saturates toward0, and gradients for other parameters are correspondingly downweighted, including those that control content representations\. As a result, content signals𝐱C\\mathbf\{x\}\_\{C\}may be*starved of learning signal*even if they are predictive and essential for strategic generalization\.
This starvation mechanism is not tied to a particular hypothesis class\. It follows from loss geometry \(saturation\) and path dependence: once one feature group achieves separation early, the optimizer has little incentive to develop alternative separating representations\. While*gradient starvation*is discussed in the context of over\-parameterized neural networks under gradient descent updates\(Pezeshkiet al\.,[2021](https://arxiv.org/html/2606.09891#bib.bib9)\), we use the term more broadly here, with a particular focus on the semantics\-driven case of historical belief versus content\-based merit signals in ranking and prediction\.
### 2\.2\.Formal objective and problem statement
Letℱ\\mathcal\{F\}be a hypothesis class and letf∈ℱf\\in\\mathcal\{F\}be a predictor, which induces a score\-based allocation policyπf=argsort\(f\)\\pi\_\{f\}=\\mathrm\{argsort\}\(f\)\. Letℒ\(f;𝐳\)\\mathcal\{L\}\(f;\\mathbf\{z\}\)be the task loss \(e\.g\., MSE, cross\-entropy, pairwise or listwise ranking loss\)\. For any distribution𝒟\\mathcal\{D\}over examples𝐳\\mathbf\{z\}, define the population risk
ℛ𝒟\(f\)≡𝔼𝐳∼𝒟\[ℒ\(f;𝐳\)\]\.\\mathcal\{R\}\_\{\\mathcal\{D\}\}\(f\)\\;\\equiv\\;\\mathbb\{E\}\_\{\\mathbf\{z\}\\sim\\mathcal\{D\}\}\\\!\\left\[\\mathcal\{L\}\(f;\\mathbf\{z\}\)\\right\]\.We observe data𝒟t∼𝒫\(πt\)\\mathcal\{D\}\_\{t\}\\sim\\mathcal\{P\}\(\\pi\_\{t\}\)but aim to minimize risk under a platform\-chosen target distribution𝒬\\mathcal\{Q\}\. LetfFullf^\{\\textsc\{Full\}\}denote the model trained by standard joint empirical risk minimization \(ERM\) on all features\(𝐱C,𝐱H\)\(\\mathbf\{x\}\_\{C\},\\mathbf\{x\}\_\{H\}\), and letfContentf^\{\\textsc\{Content\}\}denote the content\-only model trained with𝐱H\\mathbf\{x\}\_\{H\}masked\. Our goal is to design a training procedure that produces a predictorffsuch that:
1. \(1\)Target robustness:ℛ𝒬\(f\)\\mathcal\{R\}\_\{\\mathcal\{Q\}\}\(f\)improves relative tofFullf^\{\\textsc\{Full\}\}, especially on strategic subpopulations \(e\.g\., cold\-start or under\-exposed inventory\)\.
2. \(2\)Source competitiveness:ℛ𝒫\(πt\)\(f\)\\mathcal\{R\}\_\{\\mathcal\{P\}\(\\pi\_\{t\}\)\}\(f\)remains competitive withfFullf^\{\\textsc\{Full\}\}and does not collapse tofContentf^\{\\textsc\{Content\}\}\.
3. \(3\)Behavioral shaping / fairness:ffreduces excessive dependence on𝐱H\\mathbf\{x\}\_\{H\}, mitigating self\-reinforcing exposure disparities in the induced allocation policy, while preserving overall marketplace KPIs\.
A clean and widely applicable instantiation is the*frozen\-start*target, where belief signals are unavailable at inference time:
\(3\)𝐱H≡0\(or set to a default/prior\) under𝒬\.\\mathbf\{x\}\_\{H\}\\equiv 0\\quad\\text\{\(or set to a default/prior\) under \}\\mathcal\{Q\}\.This yields a controlled notion of distribution shift that isolates the central question: whether the learner has developed a predictive content pathway\. Although our motivating setting is ranking in dynamic marketplaces, the same problem arises in prediction tasks whenever a feature group is \(i\) highly predictive under the training distribution and \(ii\) unstable, missing, or strategically de\-emphasized under the target distribution\. Our method applies to both ranking and pointwise prediction objectives\.
## 3\.Methodology: Representation Curriculum
We propose*representation curriculum*\(RC\), a semantics\-aware training time intervention that controls reliance on exposure\-dependent historical belief signals via*temporal staging*\. RC is motivated by the observation that when𝐱H\\mathbf\{x\}\_\{H\}is highly predictive under the logged distribution, standard training can over\-invest in𝐱H\\mathbf\{x\}\_\{H\}early in optimization\. This yields a weak content pathway and poor robustness under distribution shift in the target population\.
#### Feature staging via a gating schedule\.
RC introduces a*feature\-gating schedule*m\(t\)∈\[0,1\]m\(t\)\\in\[0,1\]over training iterationstt, defining the gated input
\(4\)𝐱~t≡\(𝐱C,m\(t\)𝐱H\)\.\\tilde\{\\mathbf\{x\}\}\_\{t\}\\;\\equiv\\;\(\\mathbf\{x\}\_\{C\},\\;m\(t\)\\,\\mathbf\{x\}\_\{H\}\)\.While the schedule can be gradual, we focus on a practical two\-stage curriculum:
m\(t\)=0fort≤T1andm\(t\)=1fort\>T1,m\(t\)=0\\;\\;\\text\{for\}\\;\\;t\\leq T\_\{1\}\\qquad\\text\{and\}\\qquad m\(t\)=1\\;\\;\\text\{for\}\\;\\;t\>T\_\{1\},that is, Stage 1 trains on content signals only, then Stage 2 trains on all signals\. Stage 1 produces a*content anchor model*that must explain label variation using𝐱C\\mathbf\{x\}\_\{C\}, and therefore learns content representations that are more robust to distribution shift\.
#### Anchored Stage 2 for a robust content pathway
Letf𝜽f\_\{\\boldsymbol\{\\theta\}\}denote the model class and let𝜽C\\boldsymbol\{\\theta\}\_\{C\}denote the subset of parameters implementing the*content pathway*\(e\.g\., a content tower in a two\-tower neural model, or the Stage 1 component of an additive model\)\. Let𝜽\(1\)\\boldsymbol\{\\theta\}^\{\(1\)\}be the parameters obtained after Stage 1\. Stage 2 optimizes the task loss on full features while anchoring to the Stage 1 content pathway in one \(or both\) of the following ways\.
\(i\) Prediction anchoring \(masked consistency\)\.We enforce that the Stage 2 model remains consistent with the Stage 1 model on*masked inputs*:
\(5\)ℒpred\(𝜽\)=𝔼\(𝐱,y\)∼𝒫\(πt\)\[d\(f𝜽\(𝐱C,0\),f𝜽\(1\)\(𝐱C,0\)\)\],\\mathcal\{L\}\_\{\\text\{pred\}\}\(\\boldsymbol\{\\theta\}\)\\;=\\;\\mathbb\{E\}\_\{\(\\mathbf\{x\},y\)\\sim\\mathcal\{P\}\(\\pi\_\{t\}\)\}\\Big\[d\\Big\(f\_\{\\boldsymbol\{\\theta\}\}\(\\mathbf\{x\}\_\{C\},0\),\\,f\_\{\\boldsymbol\{\\theta\}^\{\(1\)\}\}\(\\mathbf\{x\}\_\{C\},0\)\\Big\)\\Big\],whered\(⋅,⋅\)d\(\\cdot,\\cdot\)is a discrepancy in prediction space\. For regression we use squared error; for binary classification we use KL or squared logit differences; for ranking we use squared score differences on query–item pairs \(or KL on pairwise probabilities\), consistent with the downstream surrogate loss\.
\(ii\) Parameter anchoring \(content\-path protection\)\.We explicitly protect the content pathway parameters:
\(6\)ℒpar\(𝜽\)=‖𝜽C−𝜽C\(1\)‖22\.\\mathcal\{L\}\_\{\\text\{par\}\}\(\\boldsymbol\{\\theta\}\)\\;=\\;\\\|\\boldsymbol\{\\theta\}\_\{C\}\-\\boldsymbol\{\\theta\}\_\{C\}^\{\(1\)\}\\\|\_\{2\}^\{2\}\.This is useful when function\-space consistency alone is insufficient \(e\.g\., when𝐱H\\mathbf\{x\}\_\{H\}induces strong gradients that substantially alter internal representations while keeping masked outputs approximately unchanged\)\.
Prediction anchoring preserves function behavior under masked belief inputs, while parameter anchoring preserves internal content representations\.
Stage 2 objective\.Letℒ\(⋅;y\)\\mathcal\{L\}\(\\cdot\\,;\\,y\)be the primary supervised loss \(MSE, log\-loss, pairwise/listwise ranking surrogate\)\. RC trains Stage 2 with
\(7\)𝔼\(𝐱,y\)∼𝒫2\[ℒ\(f𝜽\(𝐱C,𝐱H\);y\)\]\+λΩ\(𝜽\)\+μ1ℒpred\(𝜽\)\+μ2ℒpar\(𝜽\),\\mathbb\{E\}\_\{\(\\mathbf\{x\},y\)\\sim\\mathcal\{P\}\_\{2\}\}\\\!\\left\[\\mathcal\{L\}\\\!\\left\(f\_\{\\boldsymbol\{\\theta\}\}\(\\mathbf\{x\}\_\{C\},\\mathbf\{x\}\_\{H\}\);y\\right\)\\right\]\+\\lambda\\,\\Omega\(\\boldsymbol\{\\theta\}\)\+\\mu\_\{1\}\\,\\mathcal\{L\}\_\{\\text\{pred\}\}\(\\boldsymbol\{\\theta\}\)\+\\mu\_\{2\}\\,\\mathcal\{L\}\_\{\\text\{par\}\}\(\\boldsymbol\{\\theta\}\),whereΩ\\Omegais standard regularization \(weight decay, tree constraints, early stopping\), and\(μ1,μ2\)\(\\mu\_\{1\},\\mu\_\{2\}\)control anchoring strength\. By default we take𝒫2=𝒫\(πt\)\\mathcal\{P\}\_\{2\}=\\mathcal\{P\}\(\\pi\_\{t\}\)\(the logged distribution\), but RC naturally supports stage\-specific data choices: Stage 1 can be trained on a broader proxy𝒫1\\mathcal\{P\}\_\{1\}better aligned with learning content affinity \(e\.g\., broader traffic\), while Stage 2 can use richer contexts where belief signals are informative \(e\.g\., conversion\-heavy traffic\)\.
Algorithm 1Representation Curriculum \(RC\)0:Stage\-1 data
𝒟1∼𝒫1\\mathcal\{D\}\_\{1\}\\\!\\sim\\\!\\mathcal\{P\}\_\{1\}, Stage\-2 data
𝒟2∼𝒫2\\mathcal\{D\}\_\{2\}\\\!\\sim\\\!\\mathcal\{P\}\_\{2\}; features
𝐱=\(𝐱C,𝐱H\)\\mathbf\{x\}=\(\\mathbf\{x\}\_\{C\},\\mathbf\{x\}\_\{H\}\); loss
ℒ\\mathcal\{L\}; regularizer
Ω\\Omega; anchors
\(μ1,μ2\)\(\\mu\_\{1\},\\mu\_\{2\}\)\.
1:Stage 1 \(content\-only\):train
f𝜽\(1\)f\_\{\\boldsymbol\{\\theta\}^\{\(1\)\}\}on
\{\(𝐱C,0\),y\}⊂𝒟1\\\{\(\\mathbf\{x\}\_\{C\},0\),y\\\}\\subset\\mathcal\{D\}\_\{1\}by minimizing
𝔼\[ℒ\(fθ\(𝐱C,0\);y\)\]\+λΩ\(𝜽\)\\;\\mathbb\{E\}\[\\mathcal\{L\}\(f\_\{\\theta\}\(\\mathbf\{x\}\_\{C\},0\);y\)\]\+\\lambda\\Omega\(\\boldsymbol\{\\theta\}\)\.
2:Initialize Stage 2 at
𝜽←𝜽\(1\)\\boldsymbol\{\\theta\}\\leftarrow\\boldsymbol\{\\theta\}^\{\(1\)\}\.
3:Stage 2 \(full \+ anchors\):train
f𝜽f\_\{\\boldsymbol\{\\theta\}\}on
\{\(𝐱C,𝐱H\),y\}⊂𝒟2\\\{\(\\mathbf\{x\}\_\{C\},\\mathbf\{x\}\_\{H\}\),y\\\}\\subset\\mathcal\{D\}\_\{2\}by minimizing \([7](https://arxiv.org/html/2606.09891#S3.E7)\)
4:return
f𝜽f\_\{\\boldsymbol\{\\theta\}\}\.
#### Key instantiations\.
RC is model\-agnostic; the main design choice is what constitutes the*content pathway*𝜽C\\boldsymbol\{\\theta\}\_\{C\}and how anchoring is implemented\.
Neural models \(two\-tower \+ MLP head\)\.In our recommender experiments,f𝜽f\_\{\\boldsymbol\{\\theta\}\}is a two\-tower model: a user/context tower producesuϕu\_\{\\phi\}, an item\-content tower producesvψ\(𝐱C\)v\_\{\\psi\}\(\\mathbf\{x\}\_\{C\}\), and the final score is computed by an MLP head on the concatenation \(and optionally explicit interactions\),
s𝜽\(𝐱\)=gω\(\[uϕ,vψ\(𝐱C\),uϕ⊙vψ\(𝐱C\),φ\(𝐱H\)\]\),s\_\{\\boldsymbol\{\\theta\}\}\(\\mathbf\{x\}\)=g\_\{\\omega\}\\\!\\left\(\\big\[u\_\{\\phi\},\\;v\_\{\\psi\}\(\\mathbf\{x\}\_\{C\}\),\\;u\_\{\\phi\}\\odot v\_\{\\psi\}\(\\mathbf\{x\}\_\{C\}\),\\;\\varphi\(\\mathbf\{x\}\_\{H\}\)\\big\]\\right\),whereφ\(𝐱H\)\\varphi\(\\mathbf\{x\}\_\{H\}\)embeds/scales belief features\. We define𝜽C\\boldsymbol\{\\theta\}\_\{C\}as the parameters of the content tower \(and, optionally, the content\-dependent blocks of the head\), soℒpar\\mathcal\{L\}\_\{\\text\{par\}\}directly protects content representations\. Prediction anchoringℒpred\\mathcal\{L\}\_\{\\text\{pred\}\}is computed under masked belief input\(𝐱C,0\)\(\\mathbf\{x\}\_\{C\},0\)\. Practically, stability in Stage 2 is improved by initializing from𝜽\(1\)\\boldsymbol\{\\theta\}^\{\(1\)\}and using conservative optimization \(e\.g\., not increasing learning rate, optionally lowering it for𝜽C\\boldsymbol\{\\theta\}\_\{C\}\)\.
Gradient BoostingFor Gradient Boosted Decision Trees \(GBDTs\), RC uses an integerMMas the stage boundary\. Stage 1 fitsMMtrees using only content features, yielding
f\(1\)\(𝐱C\)=f0\+∑m=1Mηhm\(𝐱C\)\.f^\{\(1\)\}\(\\mathbf\{x\}\_\{C\}\)=f\_\{0\}\+\\sum\_\{m=1\}^\{M\}\\eta\\,h\_\{m\}\(\\mathbf\{x\}\_\{C\}\)\.Stage 2 continues boosting from this*fixed*content pathway by usingf\(1\)f^\{\(1\)\}as an initial score / base margin and fitting the remaining trees using all features:
fRC\(𝐱C,𝐱H\)=f\(1\)\(𝐱C\)\+∑m=M\+1Tηhm\(𝐱C,𝐱H\)\.f^\{\\textsc\{RC\}\}\(\\mathbf\{x\}\_\{C\},\\mathbf\{x\}\_\{H\}\)=f^\{\(1\)\}\(\\mathbf\{x\}\_\{C\}\)\+\\sum\_\{m=M\+1\}^\{T\}\\eta\\,h\_\{m\}\(\\mathbf\{x\}\_\{C\},\\mathbf\{x\}\_\{H\}\)\.This realizes an especially strong form of parameter anchoring: the Stage 1 trees are never modified, so the content pathway cannot be overwritten\. The effective deviation introduced by belief features is controlled byT−MT\-M\(remaining capacity\), shrinkageη\\eta, and early stopping\. When supported by the training framework, an additional masked\-consistency term analogous to \([5](https://arxiv.org/html/2606.09891#S3.E5)\) can be implemented by adding masked replicas of training instances with a pseudo\-target equal tof\(1\)\(𝐱C\)f^\{\(1\)\}\(\\mathbf\{x\}\_\{C\}\); in our main GBDT experiments, tuningMMtogether with standard boosting regularization is sufficient\.
#### Relation to stochastic feature masking\.
Stochastic feature masking / dropout\(Srivastavaet al\.,[2014](https://arxiv.org/html/2606.09891#bib.bib36)\)randomly zeros subsets of inputs throughout training\. While it can reduce reliance on any single feature, it is*semantics\-oblivious*and does not ensure a robust content pathway is learned*before*exposure\-dependent shortcuts are introduced\. In contrast, RC explicitly stages access to𝐱H\\mathbf\{x\}\_\{H\}and then anchors the model to preserve content behavior and/or content parameters\. This targeted control is central to improving robustness on strategic populations where historic belief signals are not reliable predictors\.
#### Evaluation and reliance diagnostics\.
RC aims to reduce over\-reliance on exposure\-dependent belief signals while maintaining overall utility\. Accordingly, we report \(i\) standard in\-distribution metrics on held\-out splits from𝒫\(πt\)\\mathcal\{P\}\(\\pi\_\{t\}\), \(ii\) target robustness metrics on𝒬\\mathcal\{Q\}—notably*frozen\-start*evaluation where belief signals are set to defaults, and \(iii\) reliance diagnostics such as feature importance in trees, sensitivity/attribution measures in neural models, and popularity/coverage metrics in ranking \(Section[5](https://arxiv.org/html/2606.09891#S5)\)\.
## 4\.Theoretical Guarantees
We formalize Representation Curriculum in a linear\-Gaussian setting with ridge regularization\. This yields closed\-form characterizations that make the “shape the update” perspective precise and quantify the tradeoff between performance on the logged distribution𝒫\(πt\)\\mathcal\{P\}\(\\pi\_\{t\}\)and a strategic target𝒬\\mathcal\{Q\}\. Let𝐱=\(𝐱C,𝐱H\)∈ℝdC\+dH\\mathbf\{x\}=\(\\mathbf\{x\}\_\{C\},\\mathbf\{x\}\_\{H\}\)\\in\\mathbb\{R\}^\{d\_\{C\}\+d\_\{H\}\}with zero mean\. We consider the population model
y=𝐱C⊤βC\+𝐱H⊤βH\+ε,𝔼\[ε∣𝐱\]=0,𝔼\[ε2\]=σ2\.y\\;=\\;\\mathbf\{x\}\_\{C\}^\{\\top\}\\beta\_\{C\}\+\\mathbf\{x\}\_\{H\}^\{\\top\}\\beta\_\{H\}\+\\varepsilon,\\qquad\\mathbb\{E\}\[\\varepsilon\\mid\\mathbf\{x\}\]=0,\\ \\ \\mathbb\{E\}\[\\varepsilon^\{2\}\]=\\sigma^\{2\}\.LetΣ𝒟=𝔼𝒟\[𝐱𝐱⊤\]\\Sigma\_\{\\mathcal\{D\}\}=\\mathbb\{E\}\_\{\\mathcal\{D\}\}\[\\mathbf\{x\}\\mathbf\{x\}^\{\\top\}\]denote the covariance under distribution𝒟\\mathcal\{D\}, and similarlyΣ𝒟,CC=𝔼𝒟\[𝐱C𝐱C⊤\]\\Sigma\_\{\\mathcal\{D\},CC\}=\\mathbb\{E\}\_\{\\mathcal\{D\}\}\[\\mathbf\{x\}\_\{C\}\\mathbf\{x\}\_\{C\}^\{\\top\}\]\. We compare three estimators trained on the logged distribution𝒫\\mathcal\{P\}:
#### \(i\) Full ridge \(no curriculum\)\.
\(8\)𝐰Full=argmin𝐰C,𝐰H𝔼𝒫\[\(y−𝐱C⊤𝐰C−𝐱H⊤𝐰H\)2\]\+λ\(‖𝐰C‖22\+‖𝐰H‖22\)\.\\mathbf\{w\}^\{\\textsc\{Full\}\}\\;=\\;\\arg\\min\_\{\\mathbf\{w\}\_\{C\},\\mathbf\{w\}\_\{H\}\}\\;\\mathbb\{E\}\_\{\\mathcal\{P\}\}\\\!\\left\[\(y\-\\mathbf\{x\}\_\{C\}^\{\\top\}\\mathbf\{w\}\_\{C\}\-\\mathbf\{x\}\_\{H\}^\{\\top\}\\mathbf\{w\}\_\{H\}\)^\{2\}\\right\]\+\\lambda\(\\\|\\mathbf\{w\}\_\{C\}\\\|\_\{2\}^\{2\}\+\\\|\\mathbf\{w\}\_\{H\}\\\|\_\{2\}^\{2\}\)\.
#### \(ii\) Content\-only ridge \(Stage 1\)\.
\(9\)𝐰C\(1\)=argmin𝐰C\\displaystyle\\mathbf\{w\}\_\{C\}^\{\(1\)\}=\\arg\\min\_\{\\mathbf\{w\}\_\{C\}\}𝔼𝒫\[\(y−𝐱C⊤𝐰C\)2\]\+λ‖𝐰C‖22,\\displaystyle\\mathbb\{E\}\_\{\\mathcal\{P\}\}\\\!\\left\[\\big\(y\-\\mathbf\{x\}\_\{C\}^\{\\top\}\\mathbf\{w\}\_\{C\}\\big\)^\{2\}\\right\]\+\\lambda\\\|\\mathbf\{w\}\_\{C\}\\\|\_\{2\}^\{2\},𝐰Content≡\(𝐰C\(1\),0\)\.\\displaystyle\\mathbf\{w\}^\{\\textsc\{Content\}\}\\equiv\(\\mathbf\{w\}\_\{C\}^\{\(1\)\},0\)\.
#### \(iii\) Representation curriculum \(anchored ridge, Stage 2\)\.
We model Stage 2 as ridge with a*content\-parameter anchor*:
\(10\)𝐰RC\(μ\)=argmin𝐰C,𝐰H\\displaystyle\\mathbf\{w\}^\{\\textsc\{RC\}\}\(\\mu\)=\\arg\\min\_\{\\mathbf\{w\}\_\{C\},\\mathbf\{w\}\_\{H\}\}𝔼𝒫\[\(y−𝐱C⊤𝐰C−𝐱H⊤𝐰H\)2\]\\displaystyle\\mathbb\{E\}\_\{\\mathcal\{P\}\}\\\!\\left\[\\big\(y\-\\mathbf\{x\}\_\{C\}^\{\\top\}\\mathbf\{w\}\_\{C\}\-\\mathbf\{x\}\_\{H\}^\{\\top\}\\mathbf\{w\}\_\{H\}\\big\)^\{2\}\\right\]\+λ‖𝐰C‖22\+λ‖𝐰H‖22\+μ‖𝐰C−𝐰C\(1\)‖22\.\\displaystyle\+\\lambda\\\|\\mathbf\{w\}\_\{C\}\\\|\_\{2\}^\{2\}\+\\lambda\\\|\\mathbf\{w\}\_\{H\}\\\|\_\{2\}^\{2\}\+\\mu\\\|\\mathbf\{w\}\_\{C\}\-\\mathbf\{w\}\_\{C\}^\{\(1\)\}\\\|\_\{2\}^\{2\}\.This is the population analogue of Stage 2 parameter anchoring in \([7](https://arxiv.org/html/2606.09891#S3.E7)\)\. Prediction anchoring yields closely related bounds; we defer that variant to Appendix[A](https://arxiv.org/html/2606.09891#A1)\.
### 4\.1\.Closed\-form characterization
Write the ridge normal equations under𝒫\\mathcal\{P\}using block covariance matrices:
ACC=Σ𝒫,CC\+λI,AHH=Σ𝒫,HH\+λI,ACH=Σ𝒫,CH,AHC=ACH⊤\.A\_\{CC\}\\\!=\\\!\\Sigma\_\{\\mathcal\{P\},CC\}\+\\lambda I,\\quad A\_\{HH\}\\\!=\\\!\\Sigma\_\{\\mathcal\{P\},HH\}\+\\lambda I,\\quad A\_\{CH\}\\\!=\\\!\\Sigma\_\{\\mathcal\{P\},CH\},\\quad A\_\{HC\}\\\!=\\\!A\_\{CH\}^\{\\top\}\.Define the Schur complement
\(11\)A0≡ACC−ACHAHH−1AHC\.A\_\{0\}\\;\\equiv\\;A\_\{CC\}\-A\_\{CH\}A\_\{HH\}^\{\-1\}A\_\{HC\}\.We assumeA0≻0A\_\{0\}\\succ 0with eigenvalues in\[amin,amax\]\[a\_\{\\min\},a\_\{\\max\}\]\. This condition fails only when content features are perfectly redundant once history is present \(e\.g\., perfect collinearity\)\.
###### Lemma 0 \(Content\-path interpolation\)\.
Define𝐰μ=μ\(A0\+μI\)−1\\mathbf\{w\}\_\{\\mu\}=\\mu\(A\_\{0\}\+\\mu I\)^\{\-1\}\. let𝐰CFull\\mathbf\{w\}\_\{C\}^\{\\textsc\{Full\}\}denote the content block of𝐰Full\\mathbf\{w\}^\{\\textsc\{Full\}\}and𝐰C\(1\)\\mathbf\{w\}\_\{C\}^\{\(1\)\}denote the Stage 1 solution\. Then the content block of the RC solution satisfies
\(12\)𝐰CRC\(μ\)=𝐰CFull\+μ\(A0\+μI\)−1\(𝐰C\(1\)−𝐰CFull\),\\mathbf\{w\}\_\{C\}^\{\\textsc\{RC\}\}\(\\mu\)\\;=\\;\\mathbf\{w\}\_\{C\}^\{\\textsc\{Full\}\}\\;\+\\;\\mu\(A\_\{0\}\+\\mu I\)^\{\-1\}\\,\\big\(\\mathbf\{w\}\_\{C\}^\{\(1\)\}\-\\mathbf\{w\}\_\{C\}^\{\\textsc\{Full\}\}\\big\),and the history block satisfies
\(13\)𝐰HRC\(μ\)=𝐰HFull−AHH−1AHC\(𝐰CRC\(μ\)−𝐰CFull\)\.\\mathbf\{w\}\_\{H\}^\{\\textsc\{RC\}\}\(\\mu\)\\;=\\;\\mathbf\{w\}\_\{H\}^\{\\textsc\{Full\}\}\\;\-\\;A\_\{HH\}^\{\-1\}A\_\{HC\}\\,\\big\(\\mathbf\{w\}\_\{C\}^\{\\textsc\{RC\}\}\(\\mu\)\-\\mathbf\{w\}\_\{C\}^\{\\textsc\{Full\}\}\\big\)\.Moreover,𝐰μ\\mathbf\{w\}\_\{\\mu\}has eigenvaluesμai\+μ∈\[0,1\)\\frac\{\\mu\}\{a\_\{i\}\+\\mu\}\\in\[0,1\), and thus
\(14\)‖𝐰CRC\(μ\)−𝐰CFull‖\\displaystyle\\big\\\|\\mathbf\{w\}\_\{C\}^\{\\textsc\{RC\}\}\(\\mu\)\-\\mathbf\{w\}\_\{C\}^\{\\textsc\{Full\}\}\\big\\\|≤μamin\+μ‖ΔC‖,\\displaystyle\\leq\\frac\{\\mu\}\{a\_\{\\min\}\+\\mu\}\\,\\\|\\Delta\_\{C\}\\\|,‖𝐰CRC\(μ\)−𝐰C\(1\)‖\\displaystyle\\big\\\|\\mathbf\{w\}\_\{C\}^\{\\textsc\{RC\}\}\(\\mu\)\-\\mathbf\{w\}\_\{C\}^\{\(1\)\}\\big\\\|≤amaxamax\+μ‖ΔC‖\.\\displaystyle\\leq\\frac\{a\_\{\\max\}\}\{a\_\{\\max\}\+\\mu\}\\,\\\|\\Delta\_\{C\}\\\|\.whereΔC≡𝐰C\(1\)−𝐰CFull\\Delta\_\{C\}\\equiv\\mathbf\{w\}\_\{C\}^\{\(1\)\}\-\\mathbf\{w\}\_\{C\}^\{\\textsc\{Full\}\}\.
Lemma[1](https://arxiv.org/html/2606.09891#S4.Thmtheorem1)makes explicit thatμ\\mutraces a continuous path between the full\-model content coefficients \(μ=0\\mu\\\!=\\\!0\) and the content\-only coefficients \(μ→∞\\mu\\\!\\to\\\!\\infty\), modulated by the conditioning of the Schur complementA0A\_\{0\}\.
We evaluate models on two distributions: \(i\) the logged source𝒫\(πt\)\\mathcal\{P\}\(\\pi\_\{t\}\), and \(ii\) a strategic target𝒬\\mathcal\{Q\}that assigns higher mass to cold/under\-exposed inventory\. To isolate the core cold\-start shift in a clean form, we analyze the canonical target where
\(15\)𝐱H≡0a\.s\. under𝒬,\\mathbf\{x\}\_\{H\}\\equiv 0\\quad\\text\{a\.s\. under \}\\mathcal\{Q\},i\.e\., historical belief signals are absent/unreliable in the target population\. This corresponds to “frozen\-start” evaluation in our neural experiments and to the sparse\-history segment in MSLR\.
For squared loss under𝒟\\mathcal\{D\}, the excess risk isℰ𝒟\(𝐰\)=𝔼𝒟\[\(𝐱⊤\(𝐰−β\)\)2\]=\(𝐰−β\)⊤Σ𝒟\(𝐰−β\)\.\\mathcal\{E\}\_\{\\mathcal\{D\}\}\(\\mathbf\{w\}\)=\\mathbb\{E\}\_\{\\mathcal\{D\}\}\[\(\\mathbf\{x\}^\{\\top\}\(\\mathbf\{w\}\-\\beta\)\)^\{2\}\]=\(\\mathbf\{w\}\-\\beta\)^\{\\top\}\\Sigma\_\{\\mathcal\{D\}\}\(\\mathbf\{w\}\-\\beta\)\.
### 4\.2\.Main theorem: Pareto tradeoff
###### Theorem 2 \(Anchored ridge yields a quantified Pareto tradeoff\)\.
AssumeA0≻0A\_\{0\}\\succ 0with eigenvalues in\[amin,amax\]\[a\_\{\\min\},a\_\{\\max\}\]and consider the cold target \([15](https://arxiv.org/html/2606.09891#S4.E15)\)\. LetΔC=𝐰C\(1\)−𝐰CFull\\Delta\_\{C\}=\\mathbf\{w\}\_\{C\}^\{\(1\)\}\-\\mathbf\{w\}\_\{C\}^\{\\textsc\{Full\}\}\. Define the source and target curvature constants
L𝒫≡λmax\(Σ𝒫\),L𝒬≡λmax\(Σ𝒬,CC\),L\_\{\\mathcal\{P\}\}\\equiv\\lambda\_\{\\max\}\(\\Sigma\_\{\\mathcal\{P\}\}\),\\qquad L\_\{\\mathcal\{Q\}\}\\equiv\\lambda\_\{\\max\}\(\\Sigma\_\{\\mathcal\{Q\},CC\}\),and the history\-coupling factorκH≡1\+‖AHH−1AHC‖op2\.\\kappa\_\{H\}\\equiv 1\+\\\|A\_\{HH\}^\{\-1\}A\_\{HC\}\\\|\_\{\\text\{op\}\}^\{2\}\.Then for allμ≥0\\mu\\geq 0the following hold:
#### \(A\) Bounded regret toFullon the source𝒫\\mathcal\{P\}\.
\(16\)\|ℰ𝒫\(𝐰RC\(μ\)\)−ℰ𝒫\(𝐰Full\)\|≤L𝒫κH\(μamin\+μ\)2‖ΔC‖2\.\\big\|\\mathcal\{E\}\_\{\\mathcal\{P\}\}\(\\mathbf\{w\}^\{\\textsc\{RC\}\}\(\\mu\)\)\-\\mathcal\{E\}\_\{\\mathcal\{P\}\}\(\\mathbf\{w\}^\{\\textsc\{Full\}\}\)\\big\|\\;\\leq\\;L\_\{\\mathcal\{P\}\}\\,\\kappa\_\{H\}\\left\(\\frac\{\\mu\}\{a\_\{\\min\}\+\\mu\}\\right\)^\{2\}\\\|\\Delta\_\{C\}\\\|^\{2\}\.
#### \(B\) Bounded regret toContenton the target𝒬\\mathcal\{Q\}\.
Since𝐱H≡0\\mathbf\{x\}\_\{H\}\\equiv 0under𝒬\\mathcal\{Q\}, only the content block matters and
\(17\)\|ℰ𝒬\(𝐰RC\(μ\)\)−ℰ𝒬\(𝐰Content\)\|≤L𝒬\(amaxamax\+μ\)2‖ΔC‖2\.\\big\|\\mathcal\{E\}\_\{\\mathcal\{Q\}\}\(\\mathbf\{w\}^\{\\textsc\{RC\}\}\(\\mu\)\)\-\\mathcal\{E\}\_\{\\mathcal\{Q\}\}\(\\mathbf\{w\}^\{\\textsc\{Content\}\}\)\\big\|\\;\\leq\\;L\_\{\\mathcal\{Q\}\}\\left\(\\frac\{a\_\{\\max\}\}\{a\_\{\\max\}\+\\mu\}\\right\)^\{2\}\\\|\\Delta\_\{C\}\\\|^\{2\}\.
#### \(C\) Improvement overFullon𝒬\\mathcal\{Q\}for large enoughμ\\mu\.
If the content\-only model is better than the full model on the target by marginm𝒬≡ℰ𝒬\(𝐰Full\)−ℰ𝒬\(𝐰Content\)\>0,m\_\{\\mathcal\{Q\}\}\\equiv\\mathcal\{E\}\_\{\\mathcal\{Q\}\}\(\\mathbf\{w\}^\{\\textsc\{Full\}\}\)\-\\mathcal\{E\}\_\{\\mathcal\{Q\}\}\(\\mathbf\{w\}^\{\\textsc\{Content\}\}\)\>0,then anyμ\\musatisfying
\(18\)L𝒬\(amaxamax\+μ\)2‖ΔC‖2≤m𝒬,L\_\{\\mathcal\{Q\}\}\\left\(\\frac\{a\_\{\\max\}\}\{a\_\{\\max\}\+\\mu\}\\right\)^\{2\}\\\|\\Delta\_\{C\}\\\|^\{2\}\\;\\leq\\;m\_\{\\mathcal\{Q\}\},guaranteesℰ𝒬\(𝐰RC\(μ\)\)≤ℰ𝒬\(𝐰Full\)\\mathcal\{E\}\_\{\\mathcal\{Q\}\}\(\\mathbf\{w\}^\{\\textsc\{RC\}\}\(\\mu\)\)\\leq\\mathcal\{E\}\_\{\\mathcal\{Q\}\}\(\\mathbf\{w\}^\{\\textsc\{Full\}\}\)\.
#### \(D\) Improvement overContenton𝒫\\mathcal\{P\}for small enoughμ\\mu\.
If the full model is better than the content\-only model on the source by marginm𝒫≡ℰ𝒫\(𝐰Content\)−ℰ𝒫\(𝐰Full\)\>0,m\_\{\\mathcal\{P\}\}\\equiv\\mathcal\{E\}\_\{\\mathcal\{P\}\}\(\\mathbf\{w\}^\{\\textsc\{Content\}\}\)\-\\mathcal\{E\}\_\{\\mathcal\{P\}\}\(\\mathbf\{w\}^\{\\textsc\{Full\}\}\)\>0,then anyμ\\musatisfying
\(19\)L𝒫κH\(μamin\+μ\)2‖ΔC‖2≤m𝒫L\_\{\\mathcal\{P\}\}\\,\\kappa\_\{H\}\\left\(\\frac\{\\mu\}\{a\_\{\\min\}\+\\mu\}\\right\)^\{2\}\\\|\\Delta\_\{C\}\\\|^\{2\}\\;\\leq\\;m\_\{\\mathcal\{P\}\}guaranteesℰ𝒫\(𝐰RC\(μ\)\)≤ℰ𝒫\(𝐰Content\)\\mathcal\{E\}\_\{\\mathcal\{P\}\}\(\\mathbf\{w\}^\{\\textsc\{RC\}\}\(\\mu\)\)\\leq\\mathcal\{E\}\_\{\\mathcal\{P\}\}\(\\mathbf\{w\}^\{\\textsc\{Content\}\}\)\.
#### Interpretation \(how to read the constants\)\.
The theorem formalizes a*Pareto tradeoff*between source and target risk asμ\\muvaries\. The conditioning of the Schur complementA0A\_\{0\}\(viaamin,amaxa\_\{\\min\},a\_\{\\max\}\) governs how quickly the RC solution moves fromFull\(μ=0\\mu\\\!=\\\!0\) towardContent\(μ→∞\\mu\\\!\\to\\\!\\infty\)\. If content and history are perfectly collinear under𝒫\\mathcal\{P\}, thenA0→0A\_\{0\}\\to 0and no curriculum can uniquely recover content effects\. The factorκH\\kappa\_\{H\}captures how strongly history coefficients respond to changes in content coefficients \(via cross\-covariance between𝐱C\\mathbf\{x\}\_\{C\}and𝐱H\\mathbf\{x\}\_\{H\}\)\. Finally, the marginsm𝒫,m𝒬m\_\{\\mathcal\{P\}\},m\_\{\\mathcal\{Q\}\}encode the empirical tension:Fulltypically wins on𝒫\\mathcal\{P\}\(history is predictive\), whileContentcan win on𝒬\\mathcal\{Q\}\(history is absent/unreliable\)\.
###### Corollary 0 \(An explicitμ\\mu\-interval and a Pareto curve\)\.
LetΔ≡‖ΔC‖\\Delta\\equiv\\\|\\Delta\_\{C\}\\\|\. Ifm𝒬\>0m\_\{\\mathcal\{Q\}\}\>0andm𝒫\>0m\_\{\\mathcal\{P\}\}\>0, then anyμ\\musatisfying both \([18](https://arxiv.org/html/2606.09891#S4.E18)\) and \([19](https://arxiv.org/html/2606.09891#S4.E19)\) simultaneously improves overFullon𝒬\\mathcal\{Q\}and overContenton𝒫\\mathcal\{P\}\. Equivalently, the feasible interval is
\(20\)‖𝐰CRC\(μ\)−𝐰CFull‖\\displaystyle\\big\\\|\\mathbf\{w\}\_\{C\}^\{\\textsc\{RC\}\}\(\\mu\)\-\\mathbf\{w\}\_\{C\}^\{\\textsc\{Full\}\}\\big\\\|≤μamin\+μ‖ΔC‖,\\displaystyle\\leq\\frac\{\\mu\}\{a\_\{\\min\}\+\\mu\}\\,\\\|\\Delta\_\{C\}\\\|,‖𝐰CRC\(μ\)−𝐰C\(1\)‖\\displaystyle\\big\\\|\\mathbf\{w\}\_\{C\}^\{\\textsc\{RC\}\}\(\\mu\)\-\\mathbf\{w\}\_\{C\}^\{\(1\)\}\\big\\\|≤amaxamax\+μ‖ΔC‖\.\\displaystyle\\leq\\frac\{a\_\{\\max\}\}\{a\_\{\\max\}\+\\mu\}\\,\\\|\\Delta\_\{C\}\\\|\.Moreover, \([16](https://arxiv.org/html/2606.09891#S4.E16)\)–\([17](https://arxiv.org/html/2606.09891#S4.E17)\) quantify the Pareto curve: asμ\\muincreases, target regret decays as\(amaxamax\+μ\)2\\big\(\\frac\{a\_\{\\max\}\}\{a\_\{\\max\}\+\\mu\}\\big\)^\{2\}, while source regret grows as\(μamin\+μ\)2\\big\(\\frac\{\\mu\}\{a\_\{\\min\}\+\\mu\}\\big\)^\{2\}\.
Assumption \([15](https://arxiv.org/html/2606.09891#S4.E15)\) captures the cleanest cold\-start shift \(belief signals absent\)\. More generally, the target𝒬\\mathcal\{Q\}may differ from𝒫\\mathcal\{P\}in the joint distribution of\(context,item,y\)\(\\text\{context\},\\text\{item\},y\)\(e\.g\., the platform strategically upweights contexts where cold inventory is valuable\), and𝐱H\\mathbf\{x\}\_\{H\}may be present but less correlated withyythan under𝒫\\mathcal\{P\}\. The same analysis extends by replacingΣ𝒬,CC\\Sigma\_\{\\mathcal\{Q\},CC\}and the “𝐱H≡0\\mathbf\{x\}\_\{H\}\\equiv 0” simplification with the appropriate covariance structure; we provide the general form and proof details in Appendix[A](https://arxiv.org/html/2606.09891#A1)\.
We additionally derive a conservative risk bound for GBDTs in Appendix[B](https://arxiv.org/html/2606.09891#A2), showing that RC inherits the Stage 1 anchor’s target performance up to a correction term controlled by Stage 2 capacity, yielding a verifiable sufficient condition for improving over a belief\-dominant baseline on𝒬\\mathcal\{Q\}\.
## 5\.Experiments
We evaluate RC as a mechanism for controlling reliance on exposure\-dependent historical belief signals in learning\-to\-rank \(LTR\) and response prediction\. We consider three representative experimental settings: \(i\) a public LTR benchmark with explicit behavioral features \(MSLR\-WEB10K\), \(ii\) a public recommendation benchmark with endogenous interaction aggregates \(MovieLens\-20M\), and \(iii\) a large\-scale production A/B test in sponsored product search\. Across all settings, we test the same core hypothesis:*delaying access to historical belief signals during training improves robustness on cold and strategic segments while preserving strong overall performance\.*
### 5\.1\.MSLR\-WEB10K: RC for LTR
MSLR\-WEB10K is a large\-scale supervised learning\-to\-rank benchmark released by Microsoft Research\(Qin and Liu,[2010](https://arxiv.org/html/2606.09891#bib.bib13),[2013](https://arxiv.org/html/2606.09891#bib.bib39)\)\. Each instance is a query–URL pair with graded relevance in\{0,1,2,3,4\}\\\{0,1,2,3,4\\\}and a 136\-dimensional feature vector; we follow the official five\-fold splits\.
#### Content\-based merit vs\. historical belief signals\.
A notable aspect of MSLR\-WEB10K is that it includes explicit behavioral features alongside query–document match features\. Following the dataset documentation\(Qin and Liu,[2010](https://arxiv.org/html/2606.09891#bib.bib13)\), we treat features11–133133as exposure\-independent*content\-based merit*features and the last three features \(134134–136136\) as exposure\-dependent*historical belief*features derived from user interactions:
\(21\)𝐱cont≡\(x1,…,x133\),𝐱hist≡\(x134,x135,x136\)\.\\mathbf\{x\}\_\{\\mathrm\{cont\}\}\\equiv\(x\_\{1\},\\dots,x\_\{133\}\),\\qquad\\mathbf\{x\}\_\{\\mathrm\{hist\}\}\\equiv\(x\_\{134\},x\_\{135\},x\_\{136\}\)\.
#### Cold vs\. warm query–URL pairs\.
Because MSLR\-WEB10K does not provide impression logs, we operationalize “cold” using sparsity in behavioral features\. In our main experiments, we define a query–URL pair ascoldifx136=0x\_\{136\}=0\(zero dwell time\), which yields a large cold segment: across the five folds,7474–75%75\\%of test instances are cold\.
#### Models and curriculum\.
We train LightGBM LambdaMART models with an NDCG objective using20002000trees,6363leaves, learning rate0\.050\.05, and row/feature subsampling of0\.80\.8\. For GBDTs, RC is implemented as a two\-stage boosting schedule: we first fitMMtrees using merit\-only features \(masking𝐱hist\\mathbf\{x\}\_\{\\mathrm\{hist\}\}\), then fit the remaining2000−M2000\{\-\}Mtrees using all features\. We sweepM∈\{0,100,…,2000\}M\\in\\\{0,100,\\dots,2000\\\}and report overall and cold NDCG@3 and AUC; results for other ranking metrics \(NDCG@1, MAP@1, MAP@3\) follow the same pattern\.
#### Experimental protocol\.
All MSLR\-WEB10K results are averaged over the official five folds with200200independent repeats per fold \(10001000runs total\); we report means and use significance testing at the0\.010\.01level\.
Figure 1\.MSLR: overall NDCG@3 vs\. curriculum depthMM\.Figure 2\.MSLR: cold NDCG@3 vs\. curriculum depthMM\.
#### Results and interpretation\.
RC consistently reduces reliance on belief features asMMincreases; gain\-based importance trends are reported in Figure[3](https://arxiv.org/html/2606.09891#S5.F3)\. The NDCG curves exhibit the two\-distribution tradeoff predicted by Theorem[2](https://arxiv.org/html/2606.09891#S4.Thmtheorem2)\. On cold instances \(our empirical proxy for𝒬\\mathcal\{Q\}, where belief signals are absent or attenuated\), content\-only training \(M=2000M\{=\}2000\) outperforms the full model \(M=0M\{=\}0\) on average across five folds and five repeats, and intermediate curricula further improve cold NDCG \(Figure[2](https://arxiv.org/html/2606.09891#S5.F2)\)\. On the overall test distribution \(proxy for𝒫\\mathcal\{P\}\), the full model outperforms the content\-only baseline, and overall performance declines asM→2000M\\to 2000\(Figure[1](https://arxiv.org/html/2606.09891#S5.F1)\)\. ThusMMacts as a robustness–accuracy knob: compared to the full model, RC improves cold\-start ranking while keeping overall performance close; compared to content\-only training, RC preserves much higher overall quality while retaining most of the cold\-start gains\.
#### Reliance reduction via feature importance\.
Figure[3](https://arxiv.org/html/2606.09891#S5.F3)tracks the gain\-based importance of the historical belief featuresF134–F136asMMincreases\. In the full model \(M=0M\{=\}0\),F134is by far the most important feature\. Under RC its rank drops from11atM=0M\{=\}0to44atM=100M\{=\}100,88atM=400M\{=\}400, and1414atM=1000M\{=\}1000, with the same monotone trend holding forF135andF136\.
Figure 3\.MSLR: gain\-based importance of historical belief featuresF134–F136vs\.MM\(mean±\\pm1 std over 1000 runs\)\.
#### Comparison with column subsampling \(feature masking\)\.
Table[1](https://arxiv.org/html/2606.09891#S5.T1)shows that varyingcolsample\_bytree∈\{0\.6,0\.7,0\.8,0\.9\}\\in\\\{0\.6,0\.7,0\.8,0\.9\\\}leaves the dominant feature gain in the3636k–4444k range and moves cold NDCG@3 only marginally \(0\.40490\.4049–0\.40530\.4053\), while overall NDCG@3 stays essentially at the full\-model level \(0\.44550\.4455–0\.44560\.4456\)\. RC reduces the same feature’s gain to10,02510\{,\}025\(M=100M\{=\}100\) and4,6674\{,\}667\(M=1000M\{=\}1000\), lifting cold NDCG@3 to0\.40610\.4061and0\.40760\.4076respectively\. The gap confirms that semantics\-aware staging is the key driver: column subsampling \(feature masking\) does not distinguish content\-based from history\-based features and therefore cannot target shortcut reliance\.
Table 1\.RC vs\. column subsampling \(feature masking\) on MSLR\-WEB10K\. RC achieves substantially stronger reliance reduction and cold\-start gains\.
#### AUC
Figures[4](https://arxiv.org/html/2606.09891#S5.F4)and[5](https://arxiv.org/html/2606.09891#S5.F5)report overall and cold AUC\. The general trend observed for NDCG holds for AUC as well: Representation Curriculum \(RC\) preserves strong overall AUC while improving cold AUC relative to the full model\. Since the underlying LambdaMART base model is optimized for NDCG via a pairwise loss, AUC exhibits larger variance; accordingly, the full model can achieve higher overall AUC than the content\-only model even on cold instances, which does not contradict our main findings\.
Figure 4\.MSLR: overall AUC vs\. curriculum depthMM\.Figure 5\.MSLR: cold AUC vs\. curriculum depthMM\.
### 5\.2\.MovieLens\-20M: RC in a neural recommendation setting
We next evaluate representation curriculum on a public recommendation benchmark with rich*item content*and naturally evolving*historical interaction evidence*\. We use MovieLens\-20M \(timestamped ratings\) and treat each interaction event\(u,i,t,r\)\(u,i,t,r\)as a supervised example\. While MovieLens is not collected under a controlled deployment policy, it provides a convenient testbed to \(i\) instantiate historical belief signals as*endogenous, exposure\-dependent aggregates*and \(ii\) study how curriculum changes reliance on such aggregates under controlled “cold” interventions\.
#### Task and split\.
We use a*time\-based split*to respect the causality of history features: interactions are sorted by timestamp, with the last segment held out for test and a preceding segment for validation\. All history features for an event are computed strictly from interactions*prior*to its timestamp\. We focus onclassificationwithy=1\{r≥4\}y=\\mathrm\{1\}\\\{r\\geq 4\\\}\(“like”\), which is standard for implicit\-feedback style evaluation and supports AUC and log\-loss diagnostics\.
#### Content\-based merit signals \(exposure\-independent\)\.
We define item merit features from content text \(movie title and genres\)\. Our main configuration uses sentence embeddings, particularly Sentence BERT\(Reimers and Gurevych,[2019](https://arxiv.org/html/2606.09891#bib.bib70)\)\(all\-MiniLM\-L6\-v2\) reduced to dimensionddand concatenated with lightweight metadata \(genres as tokens\)\. We also considered representations based on TF\-IDF and mean pooling of token embeddings\. On the user side, we consider two representations: \(i\)iduses a user ID embedding; \(ii\)id\+profileaugments the ID embedding with a*train\-only profile*computed by mean\-pooling the content embeddings of items the user interacted with in training\. The profile captures stable preference structure without relying on item\-level exposure aggregates at inference time\.
#### Historical belief signals \(exposure\-dependent\)\.
We build item\-level belief features using*empirical\-Bayes stabilized*estimates from interactions prior to timett\. For each event\(u,i,t\)\(u,i,t\), we form an EB posterior for the Bernoulli label \(like/dislike\) based on prior outcomes for itemiiand extract: the posterior mean \(belief\), a clipped log\-count \(evidence strength\), and an uncertainty proxy \(e\.g\., posterior variance\)\. We use a content\-based prior \(nearest\-neighbor on item content embeddings\) with pseudo\-countα\\alpha\(e\.g\.,α=10\\alpha=10\), mirroring how industrial systems stabilize sparse belief signals with content\-conditioned priors\.
#### Model class\.
We use the two\-tower architecture with MLP head, discussed in section[3](https://arxiv.org/html/2606.09891#S3)\. Historical belief signals are injected through a dedicated scalar pathway \(“history head”\) that modulates the final logit\. This decomposition aligns with our semantic split: merit pathways are available for cold items, while belief pathways are only meaningful when sufficient interaction evidence exists\.
#### Representation curriculum and anchoring\.
We train in two stages\. Stage 1 learns a merit\-only model by masking historical belief signals, including item\-ID\-specific components that do not generalize to unseen items\. Stage 2 introduces belief signals and trains the full model while*anchoring*to the Stage 1 solution\. We implement two complementary anchors: \(i\) a prediction\-consistency term \(distillation\) that keeps the Stage 2 model close to the Stage 1 outputs under merit\-only inputs, and \(ii\) a parameter anchor on the merit pathway parameters \(towers and merit\-related MLP blocks\) to prevent Stage 2 from washing out learned merit structure\. We tie anchoring strengths and sweepμ1,μ2∈\{0,0\.01,0\.1\}\\mu\_\{1\},\\mu\_\{2\}\\in\\\{0,0\.01,0\.1\\\}\.
#### Evaluations
MovieLens admits multiple “cold” notions \(e\.g\., low training interaction count or low event\-time evidence\), but these slices can be brittle because missing ratings are not true non\-exposures\. To make cold\-start generalization*causally interpretable*and robust, we reportfrozen\-startevaluations that explicitly remove belief features at inference time:*\(a\)*frozen\-zero: set belief features to their default/zero values and replace unseen item\-ID components with OOV;*\(b\)*frozen\-prior: set belief features to their EB prior values \(no evidence\)\. These interventions isolate the contribution of learned merit structure from belief shortcuts\. We report overall and Frozen start AUC as key evaluation metrics\.
#### Results summary\.
Given consistent observations across multiple configurations, we report results from a representative setting and differ full report on ablations to subsequent reproducibility studies\. TheFullmodel achieves the best in\-distribution test AUC \(0\.6780\.678\) but collapses under frozen start, dropping to AUC0\.5800\.580when belief features are removed at inference\. A two\-stage curriculum*without*anchoring already improves robustness via a warm\-start effect: initializing Stage 2 from the merit\-only anchor \(RC withμ1=μ2=0\\mu\_\{1\}=\\mu\_\{2\}=0\) raises frozen\-start AUC to0\.6200\.620\(with test AUC0\.6690\.669\)\. Prediction anchoring is the main knob that makes frozen\-start behavior match the content model: RC withμ1=0\.1\\mu\_\{1\}=0\.1andμ2=0\\mu\_\{2\}=0reaches frozen\-start AUC0\.6610\.661, essentially matching the content\-only baseline \(AUC0\.6620\.662\), while keeping test AUC0\.6640\.664\. This is the intended behavior\-shaping effect: Stage 2 leverages historical belief signals where available, while preserving a high\-quality merit pathway that remains predictive when belief is absent\.
### 5\.3\.Production A/B Test: Sponsored Product Search in a Dynamic Marketplace
We evaluate representation curriculum \(RC\) in a production sponsored product search system at a major e\-commerce marketplace\. Following the marketplace\-evaluation abstraction of\(Ebrahimzadehet al\.,[2024b](https://arxiv.org/html/2606.09891#bib.bib41)\), we view the ranker as an*allocation mechanism*that assigns exposure to items, and measure relevant exposure and utility metrics that are measurable via short horizon randomized experiments\. Our goal is to test the paper’s central claim:*semantics\-aware control of the optimization trajectory*, implemented by delaying access to a highly predictive exposure\-dependent signal, reduces over\-reliance on historic belief signals, increases exposure for items with limited historical evidence, and improves outcomes on strategic cold inventory segments, while keeping aggregate marketplace KPIs neutral\.
#### Policy, baseline, and treatment\.
The baseline is a standard “full” ranker trained on the complete feature set, including both exposure\-independent*merit*signals and exposure\-dependent*belief*signals \(history\- and ID\-derived signals\), with no curriculum\. The RC treatment uses the same training data, objective, and model capacity as the baseline, but alters the training*trajectory*by masking a single highly predictive*ID\-based belief feature*during the first training stage\. This feature acts as a strong shortcut because it summarizes item\-specific historic exposure and observed transaction evidence and collapses to a prior when an item has no historical purchases; therefore, it disproportionately advantages items with established exposure and conversions\. In the second stage, we unmask this feature and continue training with the full feature set\. \(For confidentiality, we omit platform\-specific details such as the exact feature definition and thresholds\.\)
#### Offline diagnostics \(behavioral effect of RC\)
Before online experimentation, we validate that RC changes feature reliance in the intended direction\. Compared to the baseline, RC reduces the measured importance of the masked belief feature by more than70%70\\%\(gain\-based importance\), with no material degradation in overall offline ranking metrics\. On an offline cold slice defined by items with historical impressions below a thresholdτ\\tau, RC yields a lift of\>1%\>1\\%in offline utility metrics, motivating the online test\.
#### Online experiment design and metrics\.
We run a controlled randomized A/B test over production traffic for approximately two weeks, comparing the RC policy to the baseline\. We report*relative lifts*in \(i\) exposure metrics and \(ii\) outcome metrics, evaluated on stratified inventory segments designed to isolate cold and strategic supply\. Exposure is measured as the number of times an item is shown in search results \(aggregated over placements\), while outcomes are measured via*sale velocity*: the fraction \(or count\) of listings that obtain at least one transaction within a fixed window ofYYdays from listing/publish time\.111We use sale velocity because it is sensitive to marketplace coverage and newcomer success, and it captures a long\-run objective beyond per\-impression click/purchase rates\.We measured aggregate marketplace KPIs \(e\.g\., transactions and revenue\), which remained neutral over the test window\.
#### Results and Discussions
Tables[2](https://arxiv.org/html/2606.09891#S5.T2)and[3](https://arxiv.org/html/2606.09891#S5.T3)summarize the primary observed effects in the AB test\. The RC policy increases exposure for items with weak purchase history, particularly newly listed inventory, while preserving overall business outcomes\.
Table 2\.Relative lift in exposure under RC vs\. baseline for cold inventory segments\.Table 3\.Relative lift in sale velocity under RC vs\. baseline on strategic inventory segments\.We make a few remarks to help interpret the magnitude and direction of the observed effects\. First, the intervention masks only one strong ID\-based belief feature; other historic belief signals, including uncertainty\-aware historical estimates \(e\.g\., empirical\-Bayes style aggregates\), remain available, which can attenuate shifts in the most extreme “no\-impression” segments but still meaningfully affects “no\-purchase” segments where belief features can otherwise dominate\. Second, exposure is reported across all placements while the treatment changes only sponsored ranking; the resulting reported lifts are directionally diluted, and conservative\. Third, we observe that exposure increases not only for cold items but also for items with very large historical transaction volume \(above a thresholdZZ\), alongside reduced exposure for moderately established items; qualitatively, RC appears to reduce reliance on a single strong shortcut and reallocates exposure away from a narrow band of historically advantaged inventory\. Finally, while aggregate marketplace KPIs remain neutral, sale velocity improves, aligning with a mechanism\-design objective in dynamic marketplaces: increase the probability that new and under\-exposed listings earn early exposure and can demonstrate merit through realized transactions\. Some segment\-level outcome lifts are directionally positive but estimated with larger uncertainty due to smaller subpopulation mass; we therefore emphasize the robust pattern across exposure and sale\-velocity metrics rather than over\-interpreting any single fine\-grained slice\.
### 5\.4\.Reproducibility\.
Code, notebooks, and experiment configurations for MSLR\-WEB10K and MovieLens\-20M are available at[https://github\.com/sinaBaharlouei/CurriculumRepresentation](https://github.com/sinaBaharlouei/CurriculumRepresentation), including scripts to reproduce the reported metrics and figures\. Due to privacy and platform constraints, production data and code are not released; we report aggregated A/B results and provide sufficient implementation details to replicate RC on analogous marketplace data\.
## 6\.Conclusion
Exposure\-dependent historic belief signals are powerful predictors in ranking and recommendation, but over\-reliance on them can yield brittle generalization on strategically important sub\-populations\. We introducedRepresentation Curriculum\(RC\), a semantics\-aware optimization\-trajectory intervention that trains a content pathway first \(masking belief signals\) and then enables all signals while anchoring the learned content representation\. Our linear analysis provides a closed\-form characterization and a quantified Pareto tradeoff between performance on the logged source distribution and robustness in cold start regimes\. Empirically, we show across multiple settings that RC reduces reliance on historic features and improves the performance cold\-segment without material impact on overall performance\.
Future work includes \(i\) combining feature curricula with standard instance\-based curricula by choosing stage\-specific training distributions to learn intent affinity/content\-based merit broadly before specializing with belief signals on \(converting\) contexts where belief signals are most informative \(ii\) extending guarantees beyond frozen start and \(iii\) advancing a mechanism\-design view of*controllable*allocation policies by developing explicit knobs and diagnostics for policy properties, such as exposure elasticity to popularity, coverage targets, and amortized exposure constraints aligned with marketplace welfare\.
## References
- H\. Abdollahpouri, R\. Burke, and B\. Mobasher \(2019\)Managing popularity bias in recommender systems with personalized re\-ranking\.InProceedings of the International Florida Artificial Intelligence Research Society Conference \(FLAIRS\),pp\. 413–418\.Cited by:[§1](https://arxiv.org/html/2606.09891#S1.p2.1)\.
- R\. Ardywibowo, R\. Sunki, S\. T\. L\. Kuo, and S\. Nayak \(2025\)BAYESCNS: a unified bayesian approach to address cold start and non\-stationarity in search systems at scale\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 12–20\.Cited by:[§1\.1](https://arxiv.org/html/2606.09891#S1.SS1.p3.1)\.
- M\. Arjovsky, L\. Bottou, I\. Gulrajani, and D\. Lopez\-Paz \(2019\)Invariant risk minimization\.arXiv preprint arXiv:1907\.02893\.Cited by:[§1\.1](https://arxiv.org/html/2606.09891#S1.SS1.p1.1)\.
- A\. J\. Chaney, B\. M\. Stewart, and B\. E\. Engelhardt \(2018\)How algorithmic confounding in recommendation systems increases homogeneity and decreases utility\.InProceedings of the 12th ACM conference on recommender systems,pp\. 224–232\.Cited by:[§1](https://arxiv.org/html/2606.09891#S1.p1.1)\.
- J\. Chen, Y\. Dong, X\. Wang, F\. Feng, M\. Wang, and X\. He \(2023\)Bias and debias in recommender system: a survey and future directions\.ACM Transactions on Information Systems41\(3\)\.External Links:[Document](https://dx.doi.org/10.1145/3564284)Cited by:[§1](https://arxiv.org/html/2606.09891#S1.p2.1)\.
- M\. Chen, C\. Xu, V\. Gatto, D\. Jain, A\. Kumar, and E\. Chi \(2022\)Off\-policy actor\-critic for recommender systems\.InProceedings of the 16th ACM Conference on Recommender Systems,pp\. 338–349\.Cited by:[§1\.1](https://arxiv.org/html/2606.09891#S1.SS1.p2.1)\.
- E\. Ebrahimzadeh, A\. Cozzi, and A\. Bagherjeiran \(2024a\)Counterfactual learning to rank via knowledge distillation\.InProceedings of the ACM SIGIR Workshop on eCommerce \(eCom@SIGIR\),Cited by:[§1\.1](https://arxiv.org/html/2606.09891#S1.SS1.p2.1)\.
- E\. Ebrahimzadeh, N\. Monga, H\. Gao, A\. Cozzi, and A\. Bagherjeiran \(2024b\)Ranking policy learning via marketplace expected value estimation from observational data\.Note:SURE at 18th ACM Conference on Recommender Systems \(RecSys\)Cited by:[§5\.3](https://arxiv.org/html/2606.09891#S5.SS3.p1.1)\.
- N\. Ferro, C\. Lucchese, M\. Maistro, and R\. Perego \(2018\)Continuation methods and curriculum learning for learning to rank\.InProceedings of the 27th ACM International Conference on Information and Knowledge Management,pp\. 1523–1526\.Cited by:[§1\.1](https://arxiv.org/html/2606.09891#S1.SS1.p5.1)\.
- C\. Han, T\. Yang, Z\. Wang, A\. Tran, and Q\. Ai \(2022\)Addressing cold start in product search via empirical bayes\.InProceedings of the 31st ACM International Conference on Information and Knowledge Management \(CIKM ’22\),Cited by:[§1\.1](https://arxiv.org/html/2606.09891#S1.SS1.p3.1)\.
- G\. Hinton, O\. Vinyals, and J\. Dean \(2015\)Distilling the knowledge in a neural network\.Note:arXiv preprint arXiv:1503\.02531Cited by:[§1\.1](https://arxiv.org/html/2606.09891#S1.SS1.p2.1)\.
- T\. Joachims, A\. Swaminathan, and T\. Schnabel \(2017\)Unbiased learning\-to\-rank with biased feedback\.InProceedings of the 10th ACM International Conference on Web Search and Data Mining \(WSDM ’17\),pp\. 781–789\.External Links:[Document](https://dx.doi.org/10.1145/3018661.3018699)Cited by:[§1\.1](https://arxiv.org/html/2606.09891#S1.SS1.p1.1)\.
- J\. Peters, P\. Bühlmann, and N\. Meinshausen \(2016\)Causal inference by using invariant prediction: identification and confidence intervals\.Journal of the Royal Statistical Society Series B: Statistical Methodology78\(5\),pp\. 947–1012\.Cited by:[§1\.1](https://arxiv.org/html/2606.09891#S1.SS1.p1.1)\.
- M\. Pezeshki, O\. Kaba, Y\. Bengio, A\. C\. Courville, D\. Precup, and G\. Lajoie \(2021\)Gradient starvation: a learning proclivity in neural networks\.Advances in Neural Information Processing Systems34,pp\. 1256–1272\.Cited by:[§1\.1](https://arxiv.org/html/2606.09891#S1.SS1.p5.1),[§2\.1](https://arxiv.org/html/2606.09891#S2.SS1.p5.1)\.
- T\. Qin and T\. Liu \(2010\)LETOR 4\.0 datasets \(mslr\-web10k / mslr\-web30k\)\.Note:Microsoft Research dataset pageCited by:[§5\.1](https://arxiv.org/html/2606.09891#S5.SS1.SSS0.Px1.p1.4),[§5\.1](https://arxiv.org/html/2606.09891#S5.SS1.p1.1)\.
- T\. Qin and T\. Liu \(2013\)Introducing LETOR 4\.0 datasets\.Note:arXiv preprint arXiv:1306\.2597Cited by:[§5\.1](https://arxiv.org/html/2606.09891#S5.SS1.p1.1)\.
- N\. Reimers and I\. Gurevych \(2019\)Sentence\-bert: sentence embeddings using siamese bert\-networks\.arXiv preprint arXiv:1908\.10084\.Cited by:[§5\.2](https://arxiv.org/html/2606.09891#S5.SS2.SSS0.Px2.p1.1)\.
- D\. Rothenhäusler, N\. Meinshausen, P\. Bühlmann, and J\. Peters \(2021\)Anchor regression: heterogeneous data meets causality\.Journal of the Royal Statistical Society: Series B83\(2\),pp\. 215–246\.External Links:[Document](https://dx.doi.org/10.1111/rssb.12398)Cited by:[§1\.1](https://arxiv.org/html/2606.09891#S1.SS1.p1.1)\.
- T\. Schnabel, A\. Swaminathan, A\. Singh, N\. Chandak, and T\. Joachims \(2016\)Recommendations as treatments: debiasing learning and evaluation\.InProceedings of the 33rd International Conference on Machine Learning \(ICML\),pp\. 1670–1679\.Cited by:[§1\.1](https://arxiv.org/html/2606.09891#S1.SS1.p1.1)\.
- A\. Singh and T\. Joachims \(2018\)Fairness of exposure in rankings\.InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining \(KDD ’18\),pp\. 2219–2228\.Cited by:[§1\.1](https://arxiv.org/html/2606.09891#S1.SS1.p2.1)\.
- A\. Singh and T\. Joachims \(2019\)Policy learning for fairness in ranking\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§1\.1](https://arxiv.org/html/2606.09891#S1.SS1.p2.1)\.
- A\. Singh, D\. Kempe, and T\. Joachims \(2021\)Fairness in ranking under uncertainty\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§1\.1](https://arxiv.org/html/2606.09891#S1.SS1.p2.1)\.
- N\. Srivastava, G\. Hinton, A\. Krizhevsky, I\. Sutskever, and R\. Salakhutdinov \(2014\)Dropout: a simple way to prevent neural networks from overfitting\.Journal of Machine Learning Research15,pp\. 1929–1958\.Cited by:[§1\.1](https://arxiv.org/html/2606.09891#S1.SS1.p5.1),[§3](https://arxiv.org/html/2606.09891#S3.SS0.SSS0.Px4.p1.1)\.
- A\. Swaminathan and T\. Joachims \(2015\)Counterfactual risk minimization: learning from logged bandit feedback\.InProceedings of the 32nd International Conference on Machine Learning \(ICML\),pp\. 814–823\.Cited by:[§1\.1](https://arxiv.org/html/2606.09891#S1.SS1.p1.1)\.
- J\. Tang, H\. Gao, L\. He, and S\. Katariya \(2024\)Multi\-objective learning to rank by model distillation\.InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 5783–5792\.Cited by:[§1\.1](https://arxiv.org/html/2606.09891#S1.SS1.p2.1)\.
- M\. Volkovs, G\. Yu, and T\. Poutanen \(2017\)DropoutNet: addressing cold start in recommender systems\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§1\.1](https://arxiv.org/html/2606.09891#S1.SS1.p4.1),[§1\.1](https://arxiv.org/html/2606.09891#S1.SS1.p5.1)\.
- C\. Wang, M\. Niepert, and H\. Li \(2018a\)LRMM: learning to recommend with missing modalities\.Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 3360–3370\.External Links:[Document](https://dx.doi.org/10.18653/v1/D18-1372)Cited by:[§1\.1](https://arxiv.org/html/2606.09891#S1.SS1.p4.1),[§1\.1](https://arxiv.org/html/2606.09891#S1.SS1.p5.1)\.
- X\. Wang, C\. Li, N\. Golbandi, M\. Bendersky, and M\. Najork \(2018b\)The lambdaloss framework for ranking metric optimization\.InProceedings of the 27th ACM international conference on information and knowledge management,pp\. 1313–1322\.Cited by:[§2\.1](https://arxiv.org/html/2606.09891#S2.SS1.p3.1)\.
- T\. Wei, F\. Feng, J\. Chen, Z\. Wu, J\. Yi, and X\. He \(2021\)Model\-agnostic counterfactual reasoning for eliminating popularity bias in recommender system\.InProceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining,pp\. 1791–1800\.Cited by:[§1\.1](https://arxiv.org/html/2606.09891#S1.SS1.p1.1)\.
- T\. Yang, C\. Han, C\. Luo, P\. Gupta, J\. M\. Phillips, and Q\. Ai \(2024\)Mitigating exploitation bias in learning to rank with an uncertainty\-aware empirical bayes approach\.InProceedings of the ACM Web Conference 2024 \(WWW ’24\),pp\. 1486–1496\.External Links:[Document](https://dx.doi.org/10.1145/3589334.3645487)Cited by:[§1\.1](https://arxiv.org/html/2606.09891#S1.SS1.p3.1)\.
- T\. Yang, C\. Luo, H\. Lu, P\. Gupta, B\. Yin, and Q\. Ai \(2022\)Can clicks be both labels and features?: unbiased behavior feature collection and uncertainty\-aware learning to rank\.InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval \(SIGIR ’22\),pp\. 6–17\.External Links:[Document](https://dx.doi.org/10.1145/3477495.3531948)Cited by:[§1\.1](https://arxiv.org/html/2606.09891#S1.SS1.p3.1)\.
- H\. Zeng, H\. Zamani, and V\. Vinay \(2022\)Curriculum learning for dense retrieval distillation\.InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 1979–1983\.Cited by:[§1\.1](https://arxiv.org/html/2606.09891#S1.SS1.p5.1)\.
- Y\. Zhang, F\. Feng, X\. He, T\. Wei, C\. Song, G\. Ling, and Y\. Zhang \(2021\)Causal intervention for leveraging popularity bias in recommendation\.InProceedings of the 44th international ACM SIGIR conference on research and development in information retrieval,pp\. 11–20\.Cited by:[§1\.1](https://arxiv.org/html/2606.09891#S1.SS1.p1.1)\.
## Appendix AProofs for Section[4](https://arxiv.org/html/2606.09891#S4)
This appendix provides full proofs for Lemma[1](https://arxiv.org/html/2606.09891#S4.Thmtheorem1), Theorem[2](https://arxiv.org/html/2606.09891#S4.Thmtheorem2), and Corollary[3](https://arxiv.org/html/2606.09891#S4.Thmtheorem3)\. We follow the notation of Section[4](https://arxiv.org/html/2606.09891#S4):
x=\(xC,xH\)∈ℝdC\+dHx=\(x\_\{C\},x\_\{H\}\)\\in\\mathbb\{R\}^\{d\_\{C\}\+d\_\{H\}\},y=xC⊤βC\+xH⊤βH\+εy=x\_\{C\}^\{\\top\}\\beta\_\{C\}\+x\_\{H\}^\{\\top\}\\beta\_\{H\}\+\\varepsilonwith𝔼\[ε∣x\]=0\\mathbb\{E\}\[\\varepsilon\\mid x\]=0, andΣ𝒟=𝔼𝒟\[xx⊤\]\\Sigma\_\{\\mathcal\{D\}\}=\\mathbb\{E\}\_\{\\mathcal\{D\}\}\[xx^\{\\top\}\]\.
### A\.1\.Population Normal Equation
For any distribution𝒟\\mathcal\{D\}and anyw=\(wC,wH\)w=\(w\_\{C\},w\_\{H\}\), define the squared\-loss riskR𝒟\(w\)=𝔼𝒟\[\(y−xC⊤wC−xH⊤wH\)2\]R\_\{\\mathcal\{D\}\}\(w\)=\\mathbb\{E\}\_\{\\mathcal\{D\}\}\[\(y\-x\_\{C\}^\{\\top\}w\_\{C\}\-x\_\{H\}^\{\\top\}w\_\{H\}\)^\{2\}\]\. Using the model assumptions and𝔼\[xε\]=0\\mathbb\{E\}\[x\\varepsilon\]=0, we have
\(22\)R𝒟\(w\)=σ2\+\(w−β\)⊤Σ𝒟\(w−β\)\.R\_\{\\mathcal\{D\}\}\(w\)=\\sigma^\{2\}\+\(w\-\\beta\)^\{\\top\}\\Sigma\_\{\\mathcal\{D\}\}\(w\-\\beta\)\.Particularly, the*excess risk*in Section[4](https://arxiv.org/html/2606.09891#S4)isℰ𝒟\(w\)=\(w−β\)⊤Σ𝒟\(w−β\)\\mathcal\{E\}\_\{\\mathcal\{D\}\}\(w\)=\(w\-\\beta\)^\{\\top\}\\Sigma\_\{\\mathcal\{D\}\}\(w\-\\beta\)\.
#### Full ridge under𝒫\\mathcal\{P\}\.
The objective in \([8](https://arxiv.org/html/2606.09891#S4.E8)\) isR𝒫\(w\)\+λ‖w‖22R\_\{\\mathcal\{P\}\}\(w\)\+\\lambda\\\|w\\\|\_\{2\}^\{2\}\. Differentiating and setting the gradient to zero yields the population normal equations
\(23\)\(Σ𝒫\+λI\)wFull=𝔼𝒫\[xy\]=𝔼𝒫\[xx⊤\]β=Σ𝒫β\.\(\\Sigma\_\{\\mathcal\{P\}\}\+\\lambda I\)\\,w^\{\\textsc\{Full\}\}\\;=\\;\\mathbb\{E\}\_\{\\mathcal\{P\}\}\[xy\]\\;=\\;\\mathbb\{E\}\_\{\\mathcal\{P\}\}\[xx^\{\\top\}\]\\beta\\;=\\;\\Sigma\_\{\\mathcal\{P\}\}\\beta\.
#### Content\-only ridge \(Stage 1\) under𝒫\\mathcal\{P\}\.
Similarly, differentiating \([9](https://arxiv.org/html/2606.09891#S4.E9)\) gives
\(24\)\(Σ𝒫,CC\+λI\)wC\(1\)=𝔼𝒫\[xCy\]=Σ𝒫,CCβC\+Σ𝒫,CHβH\.\(\\Sigma\_\{\\mathcal\{P\},CC\}\+\\lambda I\)\\,w\_\{C\}^\{\(1\)\}\\;=\\;\\mathbb\{E\}\_\{\\mathcal\{P\}\}\[x\_\{C\}y\]\\;=\\;\\Sigma\_\{\\mathcal\{P\},CC\}\\beta\_\{C\}\+\\Sigma\_\{\\mathcal\{P\},CH\}\\beta\_\{H\}\.
#### Anchored ridge \(Stage 2\) under𝒫\\mathcal\{P\}\.
The objective in \([10](https://arxiv.org/html/2606.09891#S4.E10)\) isR𝒫\(w\)\+λ\(‖wC‖2\+‖wH‖2\)\+μ‖wC−wC\(1\)‖2R\_\{\\mathcal\{P\}\}\(w\)\+\\lambda\(\\\|w\_\{C\}\\\|^\{2\}\+\\\|w\_\{H\}\\\|^\{2\}\)\+\\mu\\\|w\_\{C\}\-w\_\{C\}^\{\(1\)\}\\\|^\{2\}\. Differentiating and setting the gradient to zero yields
\(25\)\(Σ𝒫\+λI\)wRC\(μ\)\+μ\[wCRC\(μ\)0\]=Σ𝒫β\+μ\[wC\(1\)0\]\.\(\\Sigma\_\{\\mathcal\{P\}\}\+\\lambda I\)\\,w^\{\\textsc\{RC\}\}\(\\mu\)\+\\mu\\begin\{bmatrix\}w\_\{C\}^\{\\textsc\{RC\}\}\(\\mu\)\\\\ 0\\end\{bmatrix\}\\;=\\;\\Sigma\_\{\\mathcal\{P\}\}\\beta\+\\mu\\begin\{bmatrix\}w\_\{C\}^\{\(1\)\}\\\\ 0\\end\{bmatrix\}\.
### A\.2\.Proof of Lemma[1](https://arxiv.org/html/2606.09891#S4.Thmtheorem1)\(interpolation\)
###### Proof of Lemma[1](https://arxiv.org/html/2606.09891#S4.Thmtheorem1)\.
Write the \(regularized\) block covariances under𝒫\\mathcal\{P\}as in Section[4\.1](https://arxiv.org/html/2606.09891#S4.SS1):
ACC=Σ𝒫,CC\+λI,AHH=Σ𝒫,HH\+λI,ACH=Σ𝒫,CH,AHC=ACH⊤\.A\_\{CC\}=\\Sigma\_\{\\mathcal\{P\},CC\}\+\\lambda I,\\quad A\_\{HH\}=\\Sigma\_\{\\mathcal\{P\},HH\}\+\\lambda I,\\quad A\_\{CH\}=\\Sigma\_\{\\mathcal\{P\},CH\},\\quad A\_\{HC\}=A\_\{CH\}^\{\\top\}\.Letb=𝔼𝒫\[xy\]=Σ𝒫βb=\\mathbb\{E\}\_\{\\mathcal\{P\}\}\[xy\]=\\Sigma\_\{\\mathcal\{P\}\}\\betaand write its blocks asbC=𝔼𝒫\[xCy\]b\_\{C\}=\\mathbb\{E\}\_\{\\mathcal\{P\}\}\[x\_\{C\}y\]andbH=𝔼𝒫\[xHy\]b\_\{H\}=\\mathbb\{E\}\_\{\\mathcal\{P\}\}\[x\_\{H\}y\]\. Then \([23](https://arxiv.org/html/2606.09891#A1.E23)\) is equivalent to the block system
\(26\)ACCwCFull\+ACHwHFull\\displaystyle A\_\{CC\}\\,w\_\{C\}^\{\\textsc\{Full\}\}\+A\_\{CH\}\\,w\_\{H\}^\{\\textsc\{Full\}\}=bC,\\displaystyle=b\_\{C\},AHCwCFull\+AHHwHFull\\displaystyle A\_\{HC\}\\,w\_\{C\}^\{\\textsc\{Full\}\}\+A\_\{HH\}\\,w\_\{H\}^\{\\textsc\{Full\}\}=bH\.\\displaystyle=b\_\{H\}\.SinceAHH≻0A\_\{HH\}\\succ 0, we can eliminatewHFull=AHH−1\(bH−AHCwCFull\)w\_\{H\}^\{\\textsc\{Full\}\}=A\_\{HH\}^\{\-1\}\(b\_\{H\}\-A\_\{HC\}w\_\{C\}^\{\\textsc\{Full\}\}\)and obtain
\(27\)A0wCFull=bC−ACHAHH−1bH,A\_\{0\}\\,w\_\{C\}^\{\\textsc\{Full\}\}\\;=\\;b\_\{C\}\-A\_\{CH\}A\_\{HH\}^\{\-1\}b\_\{H\},whereA0=ACC−ACHAHH−1AHCA\_\{0\}=A\_\{CC\}\-A\_\{CH\}A\_\{HH\}^\{\-1\}A\_\{HC\}is the Schur complement \([11](https://arxiv.org/html/2606.09891#S4.E11)\)\.
Next, the anchored system \([25](https://arxiv.org/html/2606.09891#A1.E25)\) can be written in blocks as
\(28\)\(ACC\+μI\)wCRC\(μ\)\+ACHwHRC\(μ\)\\displaystyle\(A\_\{CC\}\+\\mu I\)\\,w\_\{C\}^\{\\textsc\{RC\}\}\(\\mu\)\+A\_\{CH\}\\,w\_\{H\}^\{\\textsc\{RC\}\}\(\\mu\)=bC\+μwC\(1\),\\displaystyle=b\_\{C\}\+\\mu w\_\{C\}^\{\(1\)\},AHCwCRC\(μ\)\+AHHwHRC\(μ\)\\displaystyle A\_\{HC\}\\,w\_\{C\}^\{\\textsc\{RC\}\}\(\\mu\)\+A\_\{HH\}\\,w\_\{H\}^\{\\textsc\{RC\}\}\(\\mu\)=bH\.\\displaystyle=b\_\{H\}\.EliminatingwHRC\(μ\)=AHH−1\(bH−AHCwCRC\(μ\)\)w\_\{H\}^\{\\textsc\{RC\}\}\(\\mu\)=A\_\{HH\}^\{\-1\}\(b\_\{H\}\-A\_\{HC\}w\_\{C\}^\{\\textsc\{RC\}\}\(\\mu\)\)gives
\(29\)\(A0\+μI\)wCRC\(μ\)=bC\+μwC\(1\)−ACHAHH−1bH\.\(A\_\{0\}\+\\mu I\)\\,w\_\{C\}^\{\\textsc\{RC\}\}\(\\mu\)\\;=\\;b\_\{C\}\+\\mu w\_\{C\}^\{\(1\)\}\-A\_\{CH\}A\_\{HH\}^\{\-1\}b\_\{H\}\.Comparing \([29](https://arxiv.org/html/2606.09891#A1.E29)\) with \([27](https://arxiv.org/html/2606.09891#A1.E27)\), we havebC−ACHAHH−1bH=A0wCFullb\_\{C\}\-A\_\{CH\}A\_\{HH\}^\{\-1\}b\_\{H\}=A\_\{0\}w\_\{C\}^\{\\textsc\{Full\}\}, hence
\(30\)\(A0\+μI\)wCRC\(μ\)=A0wCFull\+μwC\(1\)\.\(A\_\{0\}\+\\mu I\)\\,w\_\{C\}^\{\\textsc\{RC\}\}\(\\mu\)=A\_\{0\}w\_\{C\}^\{\\textsc\{Full\}\}\+\\mu w\_\{C\}^\{\(1\)\}\.Multiplying by\(A0\+μI\)−1\(A\_\{0\}\+\\mu I\)^\{\-1\}yields
wCRC\(μ\)=\(A0\+μI\)−1A0wCFull\+\(A0\+μI\)−1μwC\(1\)\.w\_\{C\}^\{\\textsc\{RC\}\}\(\\mu\)=\(A\_\{0\}\+\\mu I\)^\{\-1\}A\_\{0\}\\,w\_\{C\}^\{\\textsc\{Full\}\}\+\(A\_\{0\}\+\\mu I\)^\{\-1\}\\mu w\_\{C\}^\{\(1\)\}\.Using\(A0\+μI\)−1A0=I−μ\(A0\+μI\)−1\(A\_\{0\}\+\\mu I\)^\{\-1\}A\_\{0\}=I\-\\mu\(A\_\{0\}\+\\mu I\)^\{\-1\}, we obtain the interpolation identity
wCRC\(μ\)=wCFull\+μ\(A0\+μI\)−1\(wC\(1\)−wCFull\),w\_\{C\}^\{\\textsc\{RC\}\}\(\\mu\)=w\_\{C\}^\{\\textsc\{Full\}\}\+\\mu\(A\_\{0\}\+\\mu I\)^\{\-1\}\\big\(w\_\{C\}^\{\(1\)\}\-w\_\{C\}^\{\\textsc\{Full\}\}\\big\),which is \([12](https://arxiv.org/html/2606.09891#S4.E12)\) withWμ=μ\(A0\+μI\)−1W\_\{\\mu\}=\\mu\(A\_\{0\}\+\\mu I\)^\{\-1\}\.
For the history block, from \([28](https://arxiv.org/html/2606.09891#A1.E28)\) and \([26](https://arxiv.org/html/2606.09891#A1.E26)\) we have
wHRC\(μ\)=AHH−1\(bH−AHCwCRC\(μ\)\),wHFull=AHH−1\(bH−AHCwCFull\),w\_\{H\}^\{\\textsc\{RC\}\}\(\\mu\)=A\_\{HH\}^\{\-1\}\\\!\\big\(b\_\{H\}\-A\_\{HC\}w\_\{C\}^\{\\textsc\{RC\}\}\(\\mu\)\\big\),\\qquad w\_\{H\}^\{\\textsc\{Full\}\}=A\_\{HH\}^\{\-1\}\\\!\\big\(b\_\{H\}\-A\_\{HC\}w\_\{C\}^\{\\textsc\{Full\}\}\\big\),and subtracting gives \([13](https://arxiv.org/html/2606.09891#S4.E13)\):
wHRC\(μ\)−wHFull=−AHH−1AHC\(wCRC\(μ\)−wCFull\)\.w\_\{H\}^\{\\textsc\{RC\}\}\(\\mu\)\-w\_\{H\}^\{\\textsc\{Full\}\}=\-A\_\{HH\}^\{\-1\}A\_\{HC\}\\big\(w\_\{C\}^\{\\textsc\{RC\}\}\(\\mu\)\-w\_\{C\}^\{\\textsc\{Full\}\}\\big\)\.
Finally, sinceA0A\_\{0\}is symmetric with eigenvalues\{ai\}\\\{a\_\{i\}\\\},WμW\_\{\\mu\}is a matrix function ofA0A\_\{0\}:Wμ=fμ\(A0\)W\_\{\\mu\}=f\_\{\\mu\}\(A\_\{0\}\)withfμ\(a\)=μ/\(a\+μ\)f\_\{\\mu\}\(a\)=\\mu/\(a\+\\mu\)\. ThereforeWμW\_\{\\mu\}has eigenvaluesfμ\(ai\)∈\[0,1\)f\_\{\\mu\}\(a\_\{i\}\)\\in\[0,1\)\. Its operator norm is‖Wμ‖op=maxiμ/\(ai\+μ\)=μ/\(amin\+μ\)\\\|W\_\{\\mu\}\\\|\_\{\\mathrm\{op\}\}=\\max\_\{i\}\\mu/\(a\_\{i\}\+\\mu\)=\\mu/\(a\_\{\\min\}\+\\mu\), which implies
‖wCRC\(μ\)−wCFull‖=‖WμΔC‖≤‖Wμ‖op‖ΔC‖≤μamin\+μ‖ΔC‖\.\\\|w\_\{C\}^\{\\textsc\{RC\}\}\(\\mu\)\-w\_\{C\}^\{\\textsc\{Full\}\}\\\|=\\\|W\_\{\\mu\}\\Delta\_\{C\}\\\|\\leq\\\|W\_\{\\mu\}\\\|\_\{\\mathrm\{op\}\}\\\|\\Delta\_\{C\}\\\|\\leq\\frac\{\\mu\}\{a\_\{\\min\}\+\\mu\}\\\|\\Delta\_\{C\}\\\|\.Moreover,wCRC\(μ\)−wC\(1\)=\(Wμ−I\)ΔC=−A0\(A0\+μI\)−1ΔCw\_\{C\}^\{\\textsc\{RC\}\}\(\\mu\)\-w\_\{C\}^\{\(1\)\}=\(W\_\{\\mu\}\-I\)\\Delta\_\{C\}=\-A\_\{0\}\(A\_\{0\}\+\\mu I\)^\{\-1\}\\Delta\_\{C\}and‖A0\(A0\+μI\)−1‖op=maxiai/\(ai\+μ\)=amax/\(amax\+μ\)\\\|A\_\{0\}\(A\_\{0\}\+\\mu I\)^\{\-1\}\\\|\_\{\\mathrm\{op\}\}=\\max\_\{i\}a\_\{i\}/\(a\_\{i\}\+\\mu\)=a\_\{\\max\}/\(a\_\{\\max\}\+\\mu\), giving the second inequality in \([20](https://arxiv.org/html/2606.09891#S4.E20)\)\. ∎
### A\.3\.An Auxiliary Inequality: Excess Risk vs\. Model Disagreement
Theorem[2](https://arxiv.org/html/2606.09891#S4.Thmtheorem2)controls how farwRC\(μ\)w^\{\\textsc\{RC\}\}\(\\mu\)moves fromwFullw^\{\\textsc\{Full\}\}\(on𝒫\\mathcal\{P\}\) and fromwContentw^\{\\textsc\{Content\}\}\(on𝒬\\mathcal\{Q\}\)\. A convenient quantity for this is the*prediction disagreement*
\(31\)𝒟𝒟\(u,v\):=𝔼𝒟\[\(x⊤\(u−v\)\)2\]=\(u−v\)⊤Σ𝒟\(u−v\),\\mathcal\{D\}\_\{\\mathcal\{D\}\}\(u,v\)\\;:=\\;\\mathbb\{E\}\_\{\\mathcal\{D\}\}\\\!\\left\[\(x^\{\\top\}\(u\-v\)\)^\{2\}\\right\]\\;=\\;\(u\-v\)^\{\\top\}\\Sigma\_\{\\mathcal\{D\}\}\(u\-v\),which is exactly the expected squared change in predictions betweenuuandvvunder𝒟\\mathcal\{D\}\.
###### Lemma 0 \(From disagreement to excess\-risk regret\)\.
For any distribution𝒟\\mathcal\{D\}and anyu,v∈ℝdC\+dHu,v\\in\\mathbb\{R\}^\{d\_\{C\}\+d\_\{H\}\},
\(32\)\|ℰ𝒟\(u\)−ℰ𝒟\(v\)\|≤𝒟𝒟\(u,v\)\.\\big\|\\sqrt\{\\mathcal\{E\}\_\{\\mathcal\{D\}\}\(u\)\}\-\\sqrt\{\\mathcal\{E\}\_\{\\mathcal\{D\}\}\(v\)\}\\big\|\\;\\leq\\;\\sqrt\{\\mathcal\{D\}\_\{\\mathcal\{D\}\}\(u,v\)\}\.Consequently,
\(33\)ℰ𝒟\(u\)≤\(ℰ𝒟\(v\)\+𝒟𝒟\(u,v\)\)2,\\mathcal\{E\}\_\{\\mathcal\{D\}\}\(u\)\\;\\leq\\;\\Big\(\\sqrt\{\\mathcal\{E\}\_\{\\mathcal\{D\}\}\(v\)\}\+\\sqrt\{\\mathcal\{D\}\_\{\\mathcal\{D\}\}\(u,v\)\}\\Big\)^\{2\},and symmetrically withuuandvvswapped\. In particular,
\(34\)\|ℰ𝒟\(u\)−ℰ𝒟\(v\)\|≤2ℰ𝒟\(v\)𝒟𝒟\(u,v\)\+𝒟𝒟\(u,v\)\.\\big\|\\mathcal\{E\}\_\{\\mathcal\{D\}\}\(u\)\-\\mathcal\{E\}\_\{\\mathcal\{D\}\}\(v\)\\big\|\\;\\leq\\;2\\sqrt\{\\mathcal\{E\}\_\{\\mathcal\{D\}\}\(v\)\}\\,\\sqrt\{\\mathcal\{D\}\_\{\\mathcal\{D\}\}\(u,v\)\}\+\\mathcal\{D\}\_\{\\mathcal\{D\}\}\(u,v\)\.
###### Proof\.
LetS𝒟=Σ𝒟1/2S\_\{\\mathcal\{D\}\}=\\Sigma\_\{\\mathcal\{D\}\}^\{1/2\}be the symmetric square root\. Thenℰ𝒟\(u\)=‖S𝒟\(u−β\)‖2\\sqrt\{\\mathcal\{E\}\_\{\\mathcal\{D\}\}\(u\)\}=\\\|S\_\{\\mathcal\{D\}\}\(u\-\\beta\)\\\|\_\{2\}and𝒟𝒟\(u,v\)=‖S𝒟\(u−v\)‖2\\sqrt\{\\mathcal\{D\}\_\{\\mathcal\{D\}\}\(u,v\)\}=\\\|S\_\{\\mathcal\{D\}\}\(u\-v\)\\\|\_\{2\}\. By the reverse triangle inequality,
\|‖S𝒟\(u−β\)‖2−‖S𝒟\(v−β\)‖2\|≤‖S𝒟\(u−v\)‖2,\\big\|\\\|S\_\{\\mathcal\{D\}\}\(u\-\\beta\)\\\|\_\{2\}\-\\\|S\_\{\\mathcal\{D\}\}\(v\-\\beta\)\\\|\_\{2\}\\big\|\\leq\\\|S\_\{\\mathcal\{D\}\}\(u\-v\)\\\|\_\{2\},which proves \([32](https://arxiv.org/html/2606.09891#A1.E32)\)\. Inequality \([33](https://arxiv.org/html/2606.09891#A1.E33)\) follows from the \(forward\) triangle inequality:‖S\(u−β\)‖≤‖S\(v−β\)‖\+‖S\(u−v\)‖\\\|S\(u\-\\beta\)\\\|\\leq\\\|S\(v\-\\beta\)\\\|\+\\\|S\(u\-v\)\\\|and squaring both sides\. Finally, \([34](https://arxiv.org/html/2606.09891#A1.E34)\) follows from \([33](https://arxiv.org/html/2606.09891#A1.E33)\) by expanding the square\. ∎
### A\.4\.Proof of Theorem[2](https://arxiv.org/html/2606.09891#S4.Thmtheorem2)
###### Proof of Theorem[2](https://arxiv.org/html/2606.09891#S4.Thmtheorem2)\.
We prove each item in turn, using Lemma[1](https://arxiv.org/html/2606.09891#S4.Thmtheorem1)and the disagreement quantity \([31](https://arxiv.org/html/2606.09891#A1.E31)\)\.
#### Step 1: A bound on the content displacement\.
Lemma[1](https://arxiv.org/html/2606.09891#S4.Thmtheorem1)giveswCRC\(μ\)−wCFull=WμΔCw\_\{C\}^\{\\textsc\{RC\}\}\(\\mu\)\-w\_\{C\}^\{\\textsc\{Full\}\}=W\_\{\\mu\}\\Delta\_\{C\}with‖Wμ‖op≤μ/\(amin\+μ\)\\\|W\_\{\\mu\}\\\|\_\{\\mathrm\{op\}\}\\leq\\mu/\(a\_\{\\min\}\+\\mu\)\. Hence
\(35\)‖wCRC\(μ\)−wCFull‖≤μamin\+μ‖ΔC‖\.\\\|w\_\{C\}^\{\\textsc\{RC\}\}\(\\mu\)\-w\_\{C\}^\{\\textsc\{Full\}\}\\\|\\leq\\frac\{\\mu\}\{a\_\{\\min\}\+\\mu\}\\\|\\Delta\_\{C\}\\\|\.
#### Step 2: History responds linearly to content changes\.
Lemma[1](https://arxiv.org/html/2606.09891#S4.Thmtheorem1)also giveswHRC\(μ\)−wHFull=−AHH−1AHC\(wCRC\(μ\)−wCFull\)w\_\{H\}^\{\\textsc\{RC\}\}\(\\mu\)\-w\_\{H\}^\{\\textsc\{Full\}\}=\-A\_\{HH\}^\{\-1\}A\_\{HC\}\(w\_\{C\}^\{\\textsc\{RC\}\}\(\\mu\)\-w\_\{C\}^\{\\textsc\{Full\}\}\)\. LetB:=AHH−1AHCB:=A\_\{HH\}^\{\-1\}A\_\{HC\}\. Then the full parameter displacementδ\(μ\):=wRC\(μ\)−wFull\\delta\(\\mu\):=w^\{\\textsc\{RC\}\}\(\\mu\)\-w^\{\\textsc\{Full\}\}satisfies
δ\(μ\)=\[δC\(μ\)δH\(μ\)\]=\[I−B\]δC\(μ\),δC\(μ\)=wCRC\(μ\)−wCFull\.\\delta\(\\mu\)=\\begin\{bmatrix\}\\delta\_\{C\}\(\\mu\)\\\\ \\delta\_\{H\}\(\\mu\)\\end\{bmatrix\}=\\begin\{bmatrix\}I\\\\ \-B\\end\{bmatrix\}\\delta\_\{C\}\(\\mu\),\\qquad\\delta\_\{C\}\(\\mu\)=w\_\{C\}^\{\\textsc\{RC\}\}\(\\mu\)\-w\_\{C\}^\{\\textsc\{Full\}\}\.Therefore,
\(36\)‖δ\(μ\)‖22\\displaystyle\\\|\\delta\(\\mu\)\\\|\_\{2\}^\{2\}=\\displaystyle=‖δC\(μ\)‖22\+‖BδC\(μ\)‖22\\displaystyle\\\|\\delta\_\{C\}\(\\mu\)\\\|\_\{2\}^\{2\}\+\\\|B\\delta\_\{C\}\(\\mu\)\\\|\_\{2\}^\{2\}≤\\displaystyle\\leq\(1\+‖B‖op2\)‖δC\(μ\)‖22=κH‖δC\(μ\)‖22,\\displaystyle\\big\(1\+\\\|B\\\|\_\{\\mathrm\{op\}\}^\{2\}\\big\)\\,\\\|\\delta\_\{C\}\(\\mu\)\\\|\_\{2\}^\{2\}=\\kappa\_\{H\}\\,\\\|\\delta\_\{C\}\(\\mu\)\\\|\_\{2\}^\{2\},whereκH=1\+‖AHH−1AHC‖op2\\kappa\_\{H\}=1\+\\\|A\_\{HH\}^\{\-1\}A\_\{HC\}\\\|\_\{\\mathrm\{op\}\}^\{2\}\.
#### \(A\) Bounded regret toFullon the source𝒫\\mathcal\{P\}\.
By definition,
𝒟𝒫\(wRC\(μ\),wFull\)\\displaystyle\\mathcal\{D\}\_\{\\mathcal\{P\}\}\(w^\{\\textsc\{RC\}\}\(\\mu\),w^\{\\textsc\{Full\}\}\)=δ\(μ\)⊤Σ𝒫δ\(μ\)\\displaystyle=\\delta\(\\mu\)^\{\\top\}\\Sigma\_\{\\mathcal\{P\}\}\\,\\delta\(\\mu\)≤λmax\(Σ𝒫\)‖δ\(μ\)‖22=L𝒫‖δ\(μ\)‖22\.\\displaystyle\\leq\\lambda\_\{\\max\}\(\\Sigma\_\{\\mathcal\{P\}\}\)\\,\\\|\\delta\(\\mu\)\\\|\_\{2\}^\{2\}=L\_\{\\mathcal\{P\}\}\\,\\\|\\delta\(\\mu\)\\\|\_\{2\}^\{2\}\.
Combining with \([36](https://arxiv.org/html/2606.09891#A1.E36)\) and \([35](https://arxiv.org/html/2606.09891#A1.E35)\) yields
\(37\)𝒟𝒫\(wRC\(μ\),wFull\)≤L𝒫κH\(μamin\+μ\)2‖ΔC‖2\.\\mathcal\{D\}\_\{\\mathcal\{P\}\}\(w^\{\\textsc\{RC\}\}\(\\mu\),w^\{\\textsc\{Full\}\}\)\\leq L\_\{\\mathcal\{P\}\}\\,\\kappa\_\{H\}\\left\(\\frac\{\\mu\}\{a\_\{\\min\}\+\\mu\}\\right\)^\{2\}\\\|\\Delta\_\{C\}\\\|^\{2\}\.To translate this into an excess\-risk regret statement, apply Lemma[1](https://arxiv.org/html/2606.09891#A1.Thmtheorem1)with\(u,v\)=\(wRC\(μ\),wFull\)\(u,v\)=\(w^\{\\textsc\{RC\}\}\(\\mu\),w^\{\\textsc\{Full\}\}\):
\|ℰ𝒫\(wRC\(μ\)\)−ℰ𝒫\(wFull\)\|≤\\displaystyle\\big\|\\mathcal\{E\}\_\{\\mathcal\{P\}\}\(w^\{\\textsc\{RC\}\}\(\\mu\)\)\-\\mathcal\{E\}\_\{\\mathcal\{P\}\}\(w^\{\\textsc\{Full\}\}\)\\big\|\\leq2ℰ𝒫\(wFull\)𝒟𝒫\(wRC\(μ\),wFull\)\\displaystyle 2\\sqrt\{\\mathcal\{E\}\_\{\\mathcal\{P\}\}\(w^\{\\textsc\{Full\}\}\)\}\\,\\sqrt\{\\mathcal\{D\}\_\{\\mathcal\{P\}\}\(w^\{\\textsc\{RC\}\}\(\\mu\),w^\{\\textsc\{Full\}\}\)\}\+𝒟𝒫\(wRC\(μ\),wFull\),\\displaystyle\+\\mathcal\{D\}\_\{\\mathcal\{P\}\}\(w^\{\\textsc\{RC\}\}\(\\mu\),w^\{\\textsc\{Full\}\}\),and substituting \([37](https://arxiv.org/html/2606.09891#A1.E37)\) yields an explicit bound controlled byμ\\mu\.
#### \(B\) Bounded regret toContenton the target𝒬\\mathcal\{Q\}\.
Under the cold target \([15](https://arxiv.org/html/2606.09891#S4.E15)\),xH≡0x\_\{H\}\\equiv 0a\.s\., hence only the content block contributes to prediction disagreement and excess risk\. Letδ~C\(μ\):=wCRC\(μ\)−wC\(1\)\\tilde\{\\delta\}\_\{C\}\(\\mu\):=w\_\{C\}^\{\\textsc\{RC\}\}\(\\mu\)\-w\_\{C\}^\{\(1\)\}\. From Lemma[1](https://arxiv.org/html/2606.09891#S4.Thmtheorem1)we haveδ~C\(μ\)=−\(I−Wμ\)ΔC=−A0\(A0\+μI\)−1ΔC\\tilde\{\\delta\}\_\{C\}\(\\mu\)=\-\(I\-W\_\{\\mu\}\)\\Delta\_\{C\}=\-A\_\{0\}\(A\_\{0\}\+\\mu I\)^\{\-1\}\\Delta\_\{C\}and
\(38\)‖δ~C\(μ\)‖≤amaxamax\+μ‖ΔC‖\.\\\|\\tilde\{\\delta\}\_\{C\}\(\\mu\)\\\|\\leq\\frac\{a\_\{\\max\}\}\{a\_\{\\max\}\+\\mu\}\\\|\\Delta\_\{C\}\\\|\.Therefore, withΣ𝒬,CC\\Sigma\_\{\\mathcal\{Q\},CC\}the content covariance under𝒬\\mathcal\{Q\},
𝒟𝒬\(wRC\(μ\),wContent\)\\displaystyle\\mathcal\{D\}\_\{\\mathcal\{Q\}\}\(w^\{\\textsc\{RC\}\}\(\\mu\),w^\{\\textsc\{Content\}\}\)=δ~C\(μ\)⊤Σ𝒬,CCδ~C\(μ\)\\displaystyle=\\tilde\{\\delta\}\_\{C\}\(\\mu\)^\{\\top\}\\Sigma\_\{\\mathcal\{Q\},CC\}\\,\\tilde\{\\delta\}\_\{C\}\(\\mu\)≤λmax\(Σ𝒬,CC\)‖δ~C\(μ\)‖2=L𝒬‖δ~C\(μ\)‖2,\\displaystyle\\leq\\lambda\_\{\\max\}\(\\Sigma\_\{\\mathcal\{Q\},CC\}\)\\,\\\|\\tilde\{\\delta\}\_\{C\}\(\\mu\)\\\|^\{2\}=L\_\{\\mathcal\{Q\}\}\\,\\\|\\tilde\{\\delta\}\_\{C\}\(\\mu\)\\\|^\{2\},
and by \([38](https://arxiv.org/html/2606.09891#A1.E38)\) we obtain
\(39\)𝒟𝒬\(wRC\(μ\),wContent\)≤L𝒬\(amaxamax\+μ\)2‖ΔC‖2\.\\mathcal\{D\}\_\{\\mathcal\{Q\}\}\(w^\{\\textsc\{RC\}\}\(\\mu\),w^\{\\textsc\{Content\}\}\)\\leq L\_\{\\mathcal\{Q\}\}\\left\(\\frac\{a\_\{\\max\}\}\{a\_\{\\max\}\+\\mu\}\\right\)^\{2\}\\\|\\Delta\_\{C\}\\\|^\{2\}\.Applying Lemma[1](https://arxiv.org/html/2606.09891#A1.Thmtheorem1)with\(u,v\)=\(wRC\(μ\),wContent\)\(u,v\)=\(w^\{\\textsc\{RC\}\}\(\\mu\),w^\{\\textsc\{Content\}\}\)yields an excess\-risk regret bound of the same form \(up to the additional factor involvingℰ𝒬\(wContent\)\\sqrt\{\\mathcal\{E\}\_\{\\mathcal\{Q\}\}\(w^\{\\textsc\{Content\}\}\)\}\)\.
#### \(C\) Improvement overFullon𝒬\\mathcal\{Q\}for large enoughμ\\mu\.
Assumem𝒬:=ℰ𝒬\(wFull\)−ℰ𝒬\(wContent\)\>0m\_\{\\mathcal\{Q\}\}:=\\mathcal\{E\}\_\{\\mathcal\{Q\}\}\(w^\{\\textsc\{Full\}\}\)\-\\mathcal\{E\}\_\{\\mathcal\{Q\}\}\(w^\{\\textsc\{Content\}\}\)\>0\. A sufficient condition forℰ𝒬\(wRC\(μ\)\)≤ℰ𝒬\(wFull\)\\mathcal\{E\}\_\{\\mathcal\{Q\}\}\(w^\{\\textsc\{RC\}\}\(\\mu\)\)\\leq\\mathcal\{E\}\_\{\\mathcal\{Q\}\}\(w^\{\\textsc\{Full\}\}\)is
ℰ𝒬\(wRC\(μ\)\)−ℰ𝒬\(wContent\)≤m𝒬\.\\displaystyle\\mathcal\{E\}\_\{\\mathcal\{Q\}\}\(w^\{\\textsc\{RC\}\}\(\\mu\)\)\-\\mathcal\{E\}\_\{\\mathcal\{Q\}\}\(w^\{\\textsc\{Content\}\}\)\\leq m\_\{\\mathcal\{Q\}\}\.By Lemma[1](https://arxiv.org/html/2606.09891#A1.Thmtheorem1)withv=wContentv=w^\{\\textsc\{Content\}\}, we have
ℰ𝒬\(wRC\(μ\)\)−ℰ𝒬\(wContent\)\\displaystyle\\mathcal\{E\}\_\{\\mathcal\{Q\}\}\(w^\{\\textsc\{RC\}\}\(\\mu\)\)\-\\mathcal\{E\}\_\{\\mathcal\{Q\}\}\(w^\{\\textsc\{Content\}\}\)≤2ℰ𝒬\(wContent\)𝒟𝒬\(wRC\(μ\),wContent\)\+𝒟𝒬\(wRC\(μ\),wContent\)\.\\displaystyle\\leq 2\\sqrt\{\\mathcal\{E\}\_\{\\mathcal\{Q\}\}\(w^\{\\textsc\{Content\}\}\)\\mathcal\{D\}\_\{\\mathcal\{Q\}\}\(w^\{\\textsc\{RC\}\}\(\\mu\),w^\{\\textsc\{Content\}\}\)\}\+\\mathcal\{D\}\_\{\\mathcal\{Q\}\}\(w^\{\\textsc\{RC\}\}\(\\mu\),w^\{\\textsc\{Content\}\}\)\.Lets\(μ\):=𝒟𝒬\(wRC\(μ\),wContent\)s\(\\mu\):=\\sqrt\{\\mathcal\{D\}\_\{\\mathcal\{Q\}\}\(w^\{\\textsc\{RC\}\}\(\\mu\),w^\{\\textsc\{Content\}\}\)\}\. The right\-hand side iss\(μ\)2\+2ℰ𝒬\(wContent\)s\(μ\)s\(\\mu\)^\{2\}\+2\\sqrt\{\\mathcal\{E\}\_\{\\mathcal\{Q\}\}\(w^\{\\textsc\{Content\}\}\)\}\\,s\(\\mu\), which is at mostm𝒬m\_\{\\mathcal\{Q\}\}whenever
\(40\)s\(μ\)≤ℰ𝒬\(wFull\)−ℰ𝒬\(wContent\)\.s\(\\mu\)\\leq\\sqrt\{\\mathcal\{E\}\_\{\\mathcal\{Q\}\}\(w^\{\\textsc\{Full\}\}\)\}\-\\sqrt\{\\mathcal\{E\}\_\{\\mathcal\{Q\}\}\(w^\{\\textsc\{Content\}\}\)\}\.Using the disagreement bound \([39](https://arxiv.org/html/2606.09891#A1.E39)\), condition \([40](https://arxiv.org/html/2606.09891#A1.E40)\) holds if
L𝒬amaxamax\+μ‖ΔC‖≤ℰ𝒬\(wFull\)−ℰ𝒬\(wContent\)\.\\sqrt\{L\_\{\\mathcal\{Q\}\}\}\\,\\frac\{a\_\{\\max\}\}\{a\_\{\\max\}\+\\mu\}\\,\\\|\\Delta\_\{C\}\\\|\\;\\leq\\;\\sqrt\{\\mathcal\{E\}\_\{\\mathcal\{Q\}\}\(w^\{\\textsc\{Full\}\}\)\}\-\\sqrt\{\\mathcal\{E\}\_\{\\mathcal\{Q\}\}\(w^\{\\textsc\{Content\}\}\)\}\.Rearranging yields an explicit lower bound onμ\\mu\. In the cold\-start regime whereℰ𝒬\(wContent\)\\mathcal\{E\}\_\{\\mathcal\{Q\}\}\(w^\{\\textsc\{Content\}\}\)is small, the right\-hand side is close tom𝒬\\sqrt\{m\_\{\\mathcal\{Q\}\}\}, recovering the simpler \(but more conservative\) condition in \([18](https://arxiv.org/html/2606.09891#S4.E18)\) up to constant factors\.
#### \(D\) Improvement overContenton𝒫\\mathcal\{P\}for small enoughμ\\mu\.
Assumem𝒫:=ℰ𝒫\(wContent\)−ℰ𝒫\(wFull\)\>0m\_\{\\mathcal\{P\}\}:=\\mathcal\{E\}\_\{\\mathcal\{P\}\}\(w^\{\\textsc\{Content\}\}\)\-\\mathcal\{E\}\_\{\\mathcal\{P\}\}\(w^\{\\textsc\{Full\}\}\)\>0\. A sufficient condition forℰ𝒫\(wRC\(μ\)\)≤ℰ𝒫\(wContent\)\\mathcal\{E\}\_\{\\mathcal\{P\}\}\(w^\{\\textsc\{RC\}\}\(\\mu\)\)\\leq\\mathcal\{E\}\_\{\\mathcal\{P\}\}\(w^\{\\textsc\{Content\}\}\)is
ℰ𝒫\(wRC\(μ\)\)≤ℰ𝒫\(wContent\)\.\\sqrt\{\\mathcal\{E\}\_\{\\mathcal\{P\}\}\(w^\{\\textsc\{RC\}\}\(\\mu\)\)\}\\leq\\sqrt\{\\mathcal\{E\}\_\{\\mathcal\{P\}\}\(w^\{\\textsc\{Content\}\}\)\}\.By Lemma[1](https://arxiv.org/html/2606.09891#A1.Thmtheorem1)withv=wFullv=w^\{\\textsc\{Full\}\},
ℰ𝒫\(wRC\(μ\)\)≤ℰ𝒫\(wFull\)\+𝒟𝒫\(wRC\(μ\),wFull\)\.\\sqrt\{\\mathcal\{E\}\_\{\\mathcal\{P\}\}\(w^\{\\textsc\{RC\}\}\(\\mu\)\)\}\\leq\\sqrt\{\\mathcal\{E\}\_\{\\mathcal\{P\}\}\(w^\{\\textsc\{Full\}\}\)\}\+\\sqrt\{\\mathcal\{D\}\_\{\\mathcal\{P\}\}\(w^\{\\textsc\{RC\}\}\(\\mu\),w^\{\\textsc\{Full\}\}\)\}\.Hence it suffices that
\(41\)𝒟𝒫\(wRC\(μ\),wFull\)≤ℰ𝒫\(wContent\)−ℰ𝒫\(wFull\)\.\\sqrt\{\\mathcal\{D\}\_\{\\mathcal\{P\}\}\(w^\{\\textsc\{RC\}\}\(\\mu\),w^\{\\textsc\{Full\}\}\)\}\\leq\\sqrt\{\\mathcal\{E\}\_\{\\mathcal\{P\}\}\(w^\{\\textsc\{Content\}\}\)\}\-\\sqrt\{\\mathcal\{E\}\_\{\\mathcal\{P\}\}\(w^\{\\textsc\{Full\}\}\)\}\.Using \([37](https://arxiv.org/html/2606.09891#A1.E37)\), condition \([41](https://arxiv.org/html/2606.09891#A1.E41)\) holds if
L𝒫κHμamin\+μ‖ΔC‖≤ℰ𝒫\(wContent\)−ℰ𝒫\(wFull\)\.\\sqrt\{L\_\{\\mathcal\{P\}\}\\kappa\_\{H\}\}\\,\\frac\{\\mu\}\{a\_\{\\min\}\+\\mu\}\\,\\\|\\Delta\_\{C\}\\\|\\;\\leq\\;\\sqrt\{\\mathcal\{E\}\_\{\\mathcal\{P\}\}\(w^\{\\textsc\{Content\}\}\)\}\-\\sqrt\{\\mathcal\{E\}\_\{\\mathcal\{P\}\}\(w^\{\\textsc\{Full\}\}\)\}\.Rearranging yields an explicit upper bound onμ\\mu\.
This completes the proof\. ∎
### A\.5\.Proof of Corollary[3](https://arxiv.org/html/2606.09891#S4.Thmtheorem3)
###### Proof of Corollary[3](https://arxiv.org/html/2606.09891#S4.Thmtheorem3)\.
The displayed displacement bounds are exactly \([14](https://arxiv.org/html/2606.09891#S4.E14)\) in Lemma[1](https://arxiv.org/html/2606.09891#S4.Thmtheorem1)\. For the Pareto\-curve statement, combine the disagreement bounds \([37](https://arxiv.org/html/2606.09891#A1.E37)\) and \([39](https://arxiv.org/html/2606.09891#A1.E39)\): the source disagreement increases withμ\\muas\(μamin\+μ\)2\\big\(\\frac\{\\mu\}\{a\_\{\\min\}\+\\mu\}\\big\)^\{2\}, while the target disagreement decreases as\(amaxamax\+μ\)2\\big\(\\frac\{a\_\{\\max\}\}\{a\_\{\\max\}\+\\mu\}\\big\)^\{2\}\. The existence of a feasibleμ\\muinterval that satisfies both improvement conditions follows from parts \(C\) and \(D\) of Theorem[2](https://arxiv.org/html/2606.09891#S4.Thmtheorem2)whenever the corresponding margins are positive\. ∎
## Appendix BGBDT Risk Bound
We now state a conservative guarantee that parallels the “controlled refinement” interpretation of Theorem[2](https://arxiv.org/html/2606.09891#S4.Thmtheorem2)\. Because boosting is nonconvex in tree structure, we do not attempt to characterize the*optimal*joint model\. Instead, we show that RC*inherits*the anchor’s target performance up to a correction term that is explicitly controlled by Stage 2 capacity\. This already yields a verifiable sufficient condition for improving over a belief\-dominant baseline on the target population\.
###### Proposition 0 \(Base\-margin RC: target\-risk stability under bounded correction\)\.
Consider squared loss and a target population𝒬\\mathcal\{Q\}over merit featuresxcontx\_\{\\mathrm\{cont\}\}\. LetFanc\(xcont\)F\_\{\\mathrm\{anc\}\}\(x\_\{\\mathrm\{cont\}\}\)be any fixed Stage 1 predictor and letFRC=Fanc\+GF\_\{\\mathrm\{RC\}\}=F\_\{\\mathrm\{anc\}\}\+Gbe the Stage 2 refined predictor\. Assume that on the target population, the Stage 2 correction is uniformly bounded:
\(42\)\|G\(xcont,xhist\)\|≤Bfor all\(xcont,xhist\)in the support of𝒬\.\|G\(x\_\{\\mathrm\{cont\}\},x\_\{\\mathrm\{hist\}\}\)\|\\leq B\\qquad\\text\{for all $\(x\_\{\\mathrm\{cont\}\},x\_\{\\mathrm\{hist\}\}\)$ in the support of $\\mathcal\{Q\}$\.\}Then the excess target risk satisfies
\(43\)ℛ𝒬\(FRC\)≤ℛ𝒬\(Fanc\)\+2Bℛ𝒬\(Fanc\)\+B2,\\mathcal\{R\}\_\{\\mathcal\{Q\}\}\(F\_\{\\mathrm\{RC\}\}\)\\;\\leq\\;\\mathcal\{R\}\_\{\\mathcal\{Q\}\}\(F\_\{\\mathrm\{anc\}\}\)\\;\+\\;2B\\,\\sqrt\{\\mathcal\{R\}\_\{\\mathcal\{Q\}\}\(F\_\{\\mathrm\{anc\}\}\)\}\\;\+\\;B^\{2\},whereℛ𝒬\(F\):=𝔼𝒬\[\(F\(x\)−𝔼\[y∣xcont\]\)2\]\\mathcal\{R\}\_\{\\mathcal\{Q\}\}\(F\):=\\mathbb\{E\}\_\{\\mathcal\{Q\}\}\\big\[\(F\(x\)\-\\mathbb\{E\}\[y\\\!\\mid\\\!x\_\{\\mathrm\{cont\}\}\]\)^\{2\}\\big\]is the \(noise\-free\) squared error on𝒬\\mathcal\{Q\}\.
#### Proof\.
Letm\(xcont\)=𝔼\[y∣xcont\]m\(x\_\{\\mathrm\{cont\}\}\)=\\mathbb\{E\}\[y\\mid x\_\{\\mathrm\{cont\}\}\]\. Thenℛ𝒬\(FRC\)=𝔼\[\(Fanc−m\+G\)2\]=𝔼\[\(Fanc−m\)2\]\+2𝔼\[\(Fanc−m\)G\]\+𝔼\[G2\]\\mathcal\{R\}\_\{\\mathcal\{Q\}\}\(F\_\{\\mathrm\{RC\}\}\)=\\mathbb\{E\}\[\(F\_\{\\mathrm\{anc\}\}\-m\+G\)^\{2\}\]=\\mathbb\{E\}\[\(F\_\{\\mathrm\{anc\}\}\-m\)^\{2\}\]\+2\\mathbb\{E\}\[\(F\_\{\\mathrm\{anc\}\}\-m\)G\]\+\\mathbb\{E\}\[G^\{2\}\]\. By Cauchy–Schwarz and the bounds\|G\|≤B\|G\|\\leq B,𝔼\[G2\]≤B2\\mathbb\{E\}\[G^\{2\}\]\\leq B^\{2\}, and
\|𝔼\[\(Fanc−m\)G\]\|≤ℛ𝒬\(Fanc\)𝔼\[G2\]≤Bℛ𝒬\(Fanc\),\|\\mathbb\{E\}\[\(F\_\{\\mathrm\{anc\}\}\-m\)G\]\|\\leq\\sqrt\{\\mathcal\{R\}\_\{\\mathcal\{Q\}\}\(F\_\{\\mathrm\{anc\}\}\)\}\\,\\sqrt\{\\mathbb\{E\}\[G^\{2\}\]\}\\leq B\\sqrt\{\\mathcal\{R\}\_\{\\mathcal\{Q\}\}\(F\_\{\\mathrm\{anc\}\}\)\},yielding Eq\. \([43](https://arxiv.org/html/2606.09891#A2.E43)\)\.□\\square
#### Implications and Practical Considerations
Condition \([42](https://arxiv.org/html/2606.09891#A2.E42)\) can be enforced or monitored in practice\. If each Stage 2 tree has leaf outputs bounded bybb\(through leaf regularization and clipping\), thenB≤η2T2bB\\leq\\eta\_\{2\}T\_\{2\}b\. More generally,BBcan be estimated empirically on a target proxy \(cold validation\) by tracking the distribution of\|G\|\|G\|\. A direct corollary is that if the Stage 1 anchor beats a baseline full model on the target by a marginγ\\gamma, then choosing Stage 2 such that the right\-hand side in Eq\. \([43](https://arxiv.org/html/2606.09891#A2.E43)\) increases risk by less thanγ\\gammaensures RC strictly improves on the target\. This parallels Theorem[2](https://arxiv.org/html/2606.09891#S4.Thmtheorem2): RC helps when the anchor is strong on𝒬\\mathcal\{Q\}and Stage 2 is a controlled refinement\.Similar Articles
Active Learners as Efficient PRP Rerankers
This paper reframes pairwise ranking prompting as active learning from noisy comparisons, introducing a noise-robust framework with a randomized-direction oracle to improve ranking quality under call constraints and address position bias.
What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA
This paper empirically studies how the composition of training data (curriculum) affects the skills learned by RL-based memory agents in multi-session question answering. It finds that curriculum composition acts as a fine-grained lever on specialization, with mixed benchmarks yielding the best overall performance and narrow out-of-domain sets transferring targeted temporal reasoning skills.
Active Learners as Efficient PRP Rerankers
Proposes reframing Pairwise Ranking Prompting (PRP) reranking as active learning from noisy pairwise comparisons, improving NDCG@10 per call under budget constraints, and introduces a randomized-direction oracle that reduces LLM calls per pair.
C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences
C2 proposes a scalable rubric-augmented reward modeling framework that trains a cooperative rubric generator and critical verifier exclusively from binary preferences, eliminating the need for costly rubric annotations while achieving up to 6.5 point gains on RM-Bench.
Rethinking the Necessity of Adaptive Retrieval-Augmented Generation through the Lens of Adaptive Listwise Ranking
This paper proposes AdaRankLLM, an adaptive retrieval framework that challenges the necessity of adaptive RAG by using listwise ranking to dynamically filter retrieved passages. The work shows that adaptive retrieval serves as a noise filter for weaker models while acting as a cost-efficiency optimizer for stronger models, with extensive experiments across multiple datasets and LLMs.