When Offline Selectors Cannot Beat the Best Single Model: A Diagnostic Study on edX Dropout Prediction

arXiv cs.LG 06/04/26, 04:00 AM Papers
Summary
This paper proposes a three-stage diagnostic framework to identify why offline model selectors fail to beat the best single model, applying it to dropout prediction on edX clickstream data. The study finds that the bottleneck is local representational ambiguity rather than learner choice or distribution shift, recommending state redesign or new data collection over further algorithm tuning.
arXiv:2606.04161v1 Announce Type: new Abstract: Different predictors often excel on different inputs, so picking the best one per instance promises higher accuracy than committing to a single model. In practice, selectors trained from logged data routinely fail to beat the strongest single predictor. Three causes typically go unseparated before more tuning is applied: a mismatched learner, a state that does not predict which model wins, or buffer-to-deployment label shift. A three-stage diagnostic rules them out on a shared buffer. Stage~1 estimates a local ceiling on oracle recovery from $k$-NN label consistency. Stage~2 asks whether paired BC and offline-RL learners (BC, DQN, and CQL across penalty weights) reach that ceiling. Stage~3 ablates the selector state to test whether richer features would raise it. The combined verdict points to the most promising next step: tuning the learner, redesigning the state, or collecting new data. We apply it to selecting among five dropout-prediction models on edX clickstream data. Across 16 windows, the oracle beats the strongest single base model by 9.7 accuracy points on average, yet BC, DQN, and CQL land in the same test-accuracy band below it (robust to a tenfold buffer sweep and $N{=}2{,}000$ held-out examples). The bottleneck is local representational ambiguity: CQL closes the imitation gap without a deployment gain (not conservatism), regret clusters tightly across learners (not tie-breaking), and the three learners converge on test accuracy (not shift). The next iteration should change the state or collect new data, not tune the offline learner further.
Original Article
View Cached Full Text
Cached at: 06/05/26, 02:21 AM
# When Offline Selectors Cannot Beat the Best Single Model: A Diagnostic Study on edX Dropout Prediction
Source: [https://arxiv.org/html/2606.04161](https://arxiv.org/html/2606.04161)
Alan Nadelsticher RuvalcabaDustin Khang LeDucThomas TraskNicholas LytleDavid Joyner

###### Abstract

Different predictors often excel on different inputs, so picking the best one per instance promises higher accuracy than committing to a single model\. In practice, selectors trained from logged data routinely fail to beat the strongest single predictor\. Three causes typically go unseparated before more tuning is applied: a mismatched learner, a state that does not predict which model wins, or buffer\-to\-deployment label shift\.

A three\-stage diagnostic rules them out on a shared buffer\. Stage 1 estimates a local ceiling on oracle recovery fromkk\-NN label consistency\. Stage 2 asks whether paired BC and offline\-RL learners \(BC, DQN, and CQL across penalty weights\) reach that ceiling\. Stage 3 ablates the selector state to test whether richer features would raise it\. The combined verdict points to the most promising next step: tuning the learner, redesigning the state, or collecting new data\.

We apply it to selecting among five dropout\-prediction models on edX clickstream data\. Across 16 windows, the oracle beats the strongest single base model by 9\.7 accuracy points on average, yet BC, DQN, and CQL land in the same test\-accuracy band below it \(robust to a tenfold buffer sweep andN=2,000N\{=\}2\{,\}000held\-out examples\)\. The bottleneck is local representational ambiguity: CQL closes the imitation gap without a deployment gain \(not conservatism\), regret clusters tightly across learners \(not tie\-breaking\), and the three learners converge on test accuracy \(not shift\)\. The next iteration should change the state or collect new data, not tune the offline learner further\.

offline reinforcement learning, meta\-learning, model selection, distribution shift, diagnostic analysis, educational data mining

## 1Introduction

Decision\-making from offline datasets is central to a growing class of applications where online interaction is expensive, slow, or ethically constrained, including scientific discovery, engineering design, healthcare, and education\(Levineet al\.,[2020](https://arxiv.org/html/2606.04161#bib.bib1)\)\. The data take many forms \(logged demonstrations, past trajectories, recorded interactions\)\. A recurring special case is meta\-learning over a pool of base models, where a policy chooses the best base model for each instance instead of committing to a single predictor\(Rice,[1976](https://arxiv.org/html/2606.04161#bib.bib2); Cruzet al\.,[2018](https://arxiv.org/html/2606.04161#bib.bib3),[2015](https://arxiv.org/html/2606.04161#bib.bib5)\)\. The setup is attractive because individual base models can exhibit context\-dependent competence, and prior educational prediction studies commonly compare several plausible model classes on clickstream and virtual\-learning\-environment data\(Liuet al\.,[2023](https://arxiv.org/html/2606.04161#bib.bib6); Tayloret al\.,[2014](https://arxiv.org/html/2606.04161#bib.bib7); Casado Hidalgoet al\.,[2022](https://arxiv.org/html/2606.04161#bib.bib15)\)\. In practice, offline meta\-learning routinely underperforms strong static baselines, and the reasons are opaque\. The weakness could be*algorithmic*\(offline RL’s conservatism\(Kumaret al\.,[2020](https://arxiv.org/html/2606.04161#bib.bib8)\)or reward misspecification\),*representational*\(the state does not contain the information needed to predict which model wins\), or*distributional*\(the oracle\-label distribution available offline differs from the one induced at evaluation\)\. Existing evaluations rarely separate these causes\. As a result, practitioners tune algorithms that cannot fix representation failures, engineer features that add zero marginal signal, or attribute failure to shift without quantifying it\.

We rule out the three hypotheses in turn with three diagnostics on a shared buffer \([Figure1](https://arxiv.org/html/2606.04161#S1.F1)\)\.Stage 1measures local label consistency\. Thekk\-nearest\-neighbor consistency quantifies how often nearby states in the buffer share the same oracle action, and a held\-out1010\-NN selector calibrates how that local ambiguity translates into test\-time imitation\.Stage 2separates algorithmic failure from representational failure\. We train a supervised behavioral\-cloning policy with hard\-label cross\-entropy and an offline Deep Q\-Network\(Mnihet al\.,[2013](https://arxiv.org/html/2606.04161#bib.bib9)\)on the same buffer\. If both fail by similar margins, the bottleneck is more plausibly shared, pointing to representation or distribution rather than algorithm choice\.Stage 3isolates the marginal value of features\. State ablations test whether the full behavioral state improves over the base\-model probability vector, and whether disagreement\-derived transforms of that vector add anything further\.

The three stages constrain each other\. Stage 1’s local\-consistency ceiling is what makes Stage 2’s learner\-agreement gap interpretable: learners converging close to a low ceiling implicates the representation, while learners separating below a high ceiling implicates the algorithm\. Stage 3 then tests whether reachable features could raise the ceiling\. Run together on a shared buffer, the three checks identify which intervention—more tuning, richer features, or upstream data collection—matters next\. Each individual diagnostic \(kk\-NN consistency, paired BC/RL ablation, feature ablation, and total\-variation buffer\-to\-test shift\) has antecedents in prior work\(Cruzet al\.,[2018](https://arxiv.org/html/2606.04161#bib.bib3); Koet al\.,[2008](https://arxiv.org/html/2606.04161#bib.bib4); Kumaret al\.,[2020](https://arxiv.org/html/2606.04161#bib.bib8)\); the contribution is the joint reading they enable\.

![Refer to caption](https://arxiv.org/html/2606.04161v1/figures/3stage-fig4.png)Figure 1:Three\-stage diagnostic protocol for offline model selection\. The same offline buffer is first used to measure local oracle consistency, then to compare algorithm\-specific and shared failure modes across BC, offline DQN, and CQL at three penalty weights, and finally to ablate the selector state to test whether additional feature groups provide marginal value beyond base\-model probabilities\.We apply the protocol to a concrete offline decision\-making task: selecting among five static dropout\-prediction models for MOOC and in\-person computer science students on edX clickstream data\. The task combines abundant observational data \(84\.5M events across223,505223\{,\}505student\-course pairs\) with ethically constrained online experimentation, because any selected model eventually triggers a human intervention\. Across 16 observation/prediction\-window configurations, the per\-instance oracle beats the strongest single base model by 9\.7 accuracy points on average \(range 4\.5–15\.5\), yet no learned selector recovers that headroom on held\-out accuracy\. On the main\(14d,14d\)\(14\\text\{d\},14\\text\{d\}\)configuration, every learner sits within±0\.01\\pm 0\.01of0\.7480\.748test accuracy, below the0\.7620\.762static reference, while the local\-consistency diagnostic is only0\.388±0\.0100\.388\\pm 0\.010\([Table1](https://arxiv.org/html/2606.04161#S6.T1)\)\. State ablations show that probabilities\-only BC nearly matches the full state, and that disagreement\-derived transforms do not materially improve it\. Buffer\-to\-test shift is small in aggregate \(marginaldTV=0\.063±0\.011d\_\{\\mathrm\{TV\}\}=0\.063\\pm 0\.011\) but locally substantial \(𝔼s\[dTV\]≈0\.29\\mathbb\{E\}\_\{s\}\[d\_\{\\mathrm\{TV\}\}\]\\approx 0\.29on the main configuration\)\. The procedure converts an opaque negative result into a verdict on whether the next iteration should target the learner, the features, or the data\-collection pipeline\. Our contributions are:

- •C1\.A combined diagnostic procedure assembled from established checks \(kk\-NN consistency, paired BC/RL ablation, state ablation, and marginal/conditionaldTVd\_\{\\mathrm\{TV\}\}\) for deciding whether offline data is sufficient before further offline tuning or online adaptation\. We apply it as a single\-task case study; whether it generalizes to other settings is left to future work\.
- •C2\.An empirical case study on selecting among five pre\-trained dropout\-prediction models on edX clickstream data\. Across BC, offline DQN, and CQL at three penalty weights, no learned selector beats the strongest single base model on held\-out accuracy despite a 9\.7\-point per\-instance oracle gap\. Mechanism checks weigh against three candidate causes: algorithmic conservatism \(CQL closes the imitation gap without a deployment gain\), hard\-label tie\-breaking \(regret clusters in\[0\.089,0\.101\]\[0\.089,0\.101\]across BC/DQN/CQL while oracle agreement spreads over0\.360\.36to0\.520\.52\), and buffer\-to\-test marginal shift\. The remaining candidate, consistent with the diagnostic readout, is local label ambiguity\. Richer offline encodings of the same probability vector \(disagreement\-derived transforms and the full 38\-d state\) do not measurably improve deployment accuracy over the 5\-d probability subspace alone, and the result is insensitive to training\-buffer size\.

Shared\-failure patterns plausibly arise in other offline meta\-learning settings that combine pre\-trained predictors, including drug\-response prediction, content recommendation, and offline hyperparameter selection\. Whether the diagnostic procedure transfers to those settings is an open question we hope to test\.

## 2Related Work

This paper sits at the intersection of offline decision\-making, dynamic model selection, and educational outcome prediction\. Offline RL\(Levineet al\.,[2020](https://arxiv.org/html/2606.04161#bib.bib1)\)must cope with support mismatch between the data distribution and the policy desired at deployment, while contextual\-bandit and offline\-policy\-evaluation work in education provide closely related decision framings\(Lan and Baraniuk,[2016](https://arxiv.org/html/2606.04161#bib.bib19); Mandelet al\.,[2014](https://arxiv.org/html/2606.04161#bib.bib20)\)\. Conservative or behavior\-constrained methods such as CQL, IQL, BCQ, BRAC, and BEAR\(Kumaret al\.,[2020](https://arxiv.org/html/2606.04161#bib.bib8); Kostrikovet al\.,[2021](https://arxiv.org/html/2606.04161#bib.bib10); Fujimotoet al\.,[2019](https://arxiv.org/html/2606.04161#bib.bib11); Wuet al\.,[2019](https://arxiv.org/html/2606.04161#bib.bib13); Kumaret al\.,[2019](https://arxiv.org/html/2606.04161#bib.bib12)\)aim to reduce offline\-RL failures caused by distribution shift, extrapolation error, or out\-of\-distribution action evaluation, but they do not by themselves resolve ambiguity in which action is optimal from the available state\. Our contribution is therefore diagnostic\. We quantify when the offline state and label construction are too ambiguous for the selectors we evaluate—a paired BC/offline\-DQN/CQL family on a shared buffer—to recover oracle headroom, and we identify this as a failure mode that conservatism alone does not resolve\.

The per\-instance selection problem traces back to algorithm selection\(Rice,[1976](https://arxiv.org/html/2606.04161#bib.bib2)\)and dynamic classifier selection\(Cruzet al\.,[2018](https://arxiv.org/html/2606.04161#bib.bib3); Koet al\.,[2008](https://arxiv.org/html/2606.04161#bib.bib4); Cruzet al\.,[2015](https://arxiv.org/html/2606.04161#bib.bib5)\), where different models dominate in different regions of feature space\. In education, related work spans clickstream\-based MOOC dropout prediction\(Dalipiet al\.,[2018](https://arxiv.org/html/2606.04161#bib.bib16); Dasset al\.,[2021](https://arxiv.org/html/2606.04161#bib.bib23); Tayloret al\.,[2014](https://arxiv.org/html/2606.04161#bib.bib7); Xinget al\.,[2016](https://arxiv.org/html/2606.04161#bib.bib25)\), e\-learning dropout prediction with model combinations\(Lykourentzouet al\.,[2009](https://arxiv.org/html/2606.04161#bib.bib27)\), higher\-education dropout prediction with administrative and LMS data\(Gorenet al\.,[2024](https://arxiv.org/html/2606.04161#bib.bib17)\), and meta\-learning for student\-performance prediction\(Casado Hidalgoet al\.,[2022](https://arxiv.org/html/2606.04161#bib.bib15)\)\. We build on that observation but focus on diagnosing why offline adaptive selection fails even when oracle headroom exists\. Each diagnostic in our procedure has antecedents elsewhere\. Thekk\-NN label consistency is closely related to neighborhood\-purity measures used in dynamic classifier selection\(Cruzet al\.,[2018](https://arxiv.org/html/2606.04161#bib.bib3); Koet al\.,[2008](https://arxiv.org/html/2606.04161#bib.bib4)\), paired BC\-versus\-offline\-RL ablations are a routine offline\-RL diagnostic pattern, and feature ablations are standard ML practice\. The contribution is the combined empirical application on this task, including the conservatism check against CQL\(Kumaret al\.,[2020](https://arxiv.org/html/2606.04161#bib.bib8)\)and the conditional buffer\-to\-test shift estimate\. The setup also resembles algorithm selection from ex post oracle\-labeled buffers more than a full logged\-bandit benchmark with historical action propensities\(Dudíket al\.,[2011](https://arxiv.org/html/2606.04161#bib.bib21); Swaminathan and Joachims,[2015](https://arxiv.org/html/2606.04161#bib.bib22)\)\.111We therefore treat offline RL here as one comparator family among several, not as the sole framing\.

The paper also contributes to evaluation protocol\. Offline\-learning benchmarks often report final policy quality without first quantifying whether the dataset contains actionable oracle headroom, whether simple local predictors can recover the relevant action labels, or whether distribution shift is large enough to dominate the result\. This protocol turns those hidden assumptions into explicit measurements\.

## 3Problem Formulation

We formalize dynamic model selection as a contextual bandit\. The oracle\-label distribution is induced explicitly by the buffer construction\. This formulation is sufficient for the one\-step decision problem studied here and lets us separate algorithmic failure from distributional failure later in the protocol\.

##### Decision task\.

We are given a pool of pre\-trained base classifiersℳ=\{m1,…,mK\}\\mathcal\{M\}=\\\{m\_\{1\},\\dots,m\_\{K\}\\\}\. HereK=5K\{=\}5\(logistic regression, random forest, gradient boosting, calibrated random forest, and a stacking ensemble\)\. For each student\-course pair, a 14\-day observation window is summarized as a states∈𝒮⊆ℝds\\in\\mathcal\{S\}\\subseteq\\mathbb\{R\}^\{d\}, and the task is to select a single modela∈𝒜=\{1,…,K\}a\\in\\mathcal\{A\}=\\\{1,\\dots,K\\\}whose prediction for that sample is used at deployment\.

##### State and reward\.

The state vector concatenates 28 engineered behavioral features \(volume, consistency, temporal patterns, trends; see[Section5](https://arxiv.org/html/2606.04161#S5)\) with a binary modality flag, theKK\-dimensional vector of base\-model dropout probabilities on that sample, their mean, and a 3\-way one\-hot bin derived from that mean, yielding the 38\-dimensional state used in the main experiments\. The deployment metric is the selected model’s zero\-one correctness on the 14\-day prediction label,Reval\(s,a\)=𝕀\[ma\(s\)predicts the true label\]R\_\{\\mathrm\{eval\}\}\(s,a\)=\\mathbb\{I\}\[m\_\{a\}\(s\)\\text\{ predicts the true label\}\]\. The offline DQN in the body is trained with the canonical oracle\-match rewardR\(s,a\)=Reval\(s,a\)R\(s,a\)=R\_\{\\mathrm\{eval\}\}\(s,a\)on the buffer\. The log\-probability shaping variant used in earlier drafts is preserved as a sensitivity row in Appendix[C](https://arxiv.org/html/2606.04161#A3)\. Throughout,*oracle agreement*is a label\-imitation diagnostic against a single argmax action,*test accuracy*is the deployment\-facing metric, and*regret*\([Section6\.4](https://arxiv.org/html/2606.04161#S6.SS4)\) is the fraction of test samples on which the policy selects an incorrect model when at least one correct model exists\.

##### Contextual\-bandit reduction\.

Because each sample’s observation window is fixed and does not evolve under the agent’s choice, within\-sample state transitions are degenerate\. We set the discount factorγ=0\\gamma\{=\}0, reducing the MDP to a contextual bandit\(Lan and Baraniuk,[2016](https://arxiv.org/html/2606.04161#bib.bib19)\)\. The DeepQQ\-Network used in[Section4](https://arxiv.org/html/2606.04161#S4)retains its full architecture but learns only the immediate action\-valueQ\(s,a\)Q\(s,a\), with no bootstrapping term\.

##### Offline buffer and buffer oracle distribution\.

The bufferℬ=\{\(si,ai⋆,ri\)\}i=1N\\mathcal\{B\}=\\\{\(s\_\{i\},a^\{\\star\}\_\{i\},r\_\{i\}\)\\\}\_\{i=1\}^\{N\}is built by 4\-fold cross\-validation on the training split\. For each fold, base models are fit on the other three folds and scored on the held\-out fold\. The*oracle action*is the hard label

ai⋆=arg⁡maxa∈𝒜⁡\[yipa\(si\)\+\(1−yi\)\(1−pa\(si\)\)\],a^\{\\star\}\_\{i\}\\;=\\;\\arg\\max\_\{a\\in\\mathcal\{A\}\}\\left\[y\_\{i\}\\,p\_\{a\}\(s\_\{i\}\)\+\(1\-y\_\{i\}\)\\,\(1\-p\_\{a\}\(s\_\{i\}\)\)\\right\],\(1\)that is, the base model assigning the highest probability to the true class\. The induced distributionπβ\(a∣s\)\\pi\_\{\\beta\}\(a\\mid s\)is the categorical distribution over these oracle labels\. It is reconstructed from cross\-validation, so it has no observed action propensities and should not be read as a historical logging policy\. Every offline method in this paper \(held\-out1010\-NN selection, behavioral cloning, offline DQN, and the state\-ablation variants\) seesℬ\\mathcal\{B\}and nothing else\.

##### Evaluation oracle and buffer\-to\-test shift\.

At test time we compute the*evaluation oracle*π⋆\(a∣s\)\\pi^\{\\star\}\(a\\mid s\)by the same argmax\-over\-base\-models construction on the held\-out 200\-sample test set\. The quantity the offline agent must close is the total\-variation distance between the marginal action distributions:

dTV\(πβ,π⋆\)\\displaystyle d\_\{\\mathrm\{TV\}\}\\\!\\left\(\\pi\_\{\\beta\},\\pi^\{\\star\}\\right\)=12∑a∈𝒜\\displaystyle=\\tfrac\{1\}\{2\}\\sum\_\{a\\in\\mathcal\{A\}\}\(2\)\|𝔼s∼ℬπβ\(a∣s\)−𝔼s∼𝒟testπ⋆\(a∣s\)\|\.\\displaystyle\\qquad\\bigl\|\\mathbb\{E\}\_\{s\\sim\\mathcal\{B\}\}\\,\\pi\_\{\\beta\}\(a\\mid s\)\-\\mathbb\{E\}\_\{s\\sim\\mathcal\{D\}\_\{\\mathrm\{test\}\}\}\\,\\pi^\{\\star\}\(a\\mid s\)\\bigr\|\.[Section6](https://arxiv.org/html/2606.04161#S6)reports this quantity per window configuration and shows it is substantial\. The CV\-induced buffer concentrates probability mass on actions that are not dominant at test time, so an offline agent whose action distribution tracksπβ\\pi\_\{\\beta\}will, by construction, place mass on actions thatπ⋆\\pi^\{\\star\}does not, with the gap lower\-bounded bydTV\(πβ,π⋆\)d\_\{\\mathrm\{TV\}\}\(\\pi\_\{\\beta\},\\pi^\{\\star\}\)\. We treatdTV\(πβ,π⋆\)d\_\{\\mathrm\{TV\}\}\(\\pi\_\{\\beta\},\\pi^\{\\star\}\)as a buffer\-to\-test oracle\-shift diagnostic, since it is not a full logged\-policy support estimate\.

##### Off\-policy evaluation framing\.

The setup doubles as an off\-policy evaluation problem, withπβ\\pi\_\{\\beta\}the behavior policy andπ⋆\\pi^\{\\star\}the target\. Our test\-accuracy and regret reports \([Section6\.4](https://arxiv.org/html/2606.04161#S6.SS4)\) provide a direct deployment\-value estimate for the selectors we evaluate, in lieu of importance\-weighted OPE\. Importance weighting\(Dudíket al\.,[2011](https://arxiv.org/html/2606.04161#bib.bib21); Swaminathan and Joachims,[2015](https://arxiv.org/html/2606.04161#bib.bib22)\)would require action propensities, which the CV reconstruction does not provide\.[Section7](https://arxiv.org/html/2606.04161#S7)returns to why conservative offline\-RL methods do not close the failure mode our diagnosis uncovers\.

## 4Three\-Stage Diagnostic Protocol

The protocol answers three questions in sequence\. \(1\) Are the oracle’s selections locally consistent in the current state representation? \(2\) If so, do a supervised and a reinforcement\-learning approach succeed or fail together \(pointing to representation or distribution\) or differently \(pointing to algorithm\)? \(3\) Which state components carry signal beyond what the base\-model probabilities already encode? Each stage produces a compact diagnostic\. A reader who runs Stage 1 alone can already tell whether the state is locally ambiguous before investing in heavier learners\.

### 4\.1Stage 1: Local label consistency viakk\-nearest neighbors

If neighboring states in the buffer disagree on the oracle action, no local selector can rely on a stable neighborhood rule\. We measure this ambiguity directly\. For each buffer sample\(si,ai⋆\)∈ℬ\(s\_\{i\},a^\{\\star\}\_\{i\}\)\\in\\mathcal\{B\}, we locate itskknearest neighbors in state space under the Euclidean metric on the standardizeddd\-dimensional state and compute the*consistency*

ci=1k∑j∈𝒩k\(si\)𝕀\[aj⋆=ai⋆\],c\_\{i\}\\;=\\;\\tfrac\{1\}\{k\}\\sum\_\{j\\in\\mathcal\{N\}\_\{k\}\(s\_\{i\}\)\}\\mathbb\{I\}\\\!\\left\[a^\{\\star\}\_\{j\}=a^\{\\star\}\_\{i\}\\right\],\(3\)then averagec¯=𝔼i\[ci\]\\bar\{c\}=\\mathbb\{E\}\_\{i\}\[c\_\{i\}\]over the buffer\. We usek=10k\{=\}10throughout\. Smallerkkintroduces sampling noise; largerkkwashes out local structure\. The quantityc¯\\bar\{c\}is a diagnostic, since it is not a formal upper bound on downstream BC or DQN performance\. Low values indicate that nearby states often map to different oracle actions, so any local state\-conditioned selector must resolve substantial ambiguity\. To calibrate this diagnostic against an actual predictor, we also evaluate a held\-out1010\-NN selector that predicts the oracle action of each test sample from its nearest training\-buffer neighbors in the same standardized state space\. We report its oracle agreement and test accuracy alongsidec¯\\bar\{c\}\.

### 4\.2Stage 2: Algorithm versus representation via paired ablation

A low local\-consistency diagnostic does not by itself distinguish algorithmic failure from representational failure\. To separate them, we train two policies on the same buffer using architectures that would fail for different reasons if the bottleneck were algorithmic\. The first is a supervised behavioral cloner that does not require reward engineering\. The second is an offlineQQ\-learner that does\. If both fail by similar margins, the bottleneck is more plausibly shared, pointing to representation or distribution\.

The behavioral\-cloning policyπBC\(a∣s\)\\pi^\{\\mathrm\{BC\}\}\(a\\mid s\)is a two\-layer MLP \(d→64→Kd\\to 64\\to K\) trained with cross\-entropy loss against the hard oracle labels from the 4\-fold cross\-validation buffer, for 30 epochs with dropout0\.20\.2and weight decay10−410^\{\-4\}\. This is a vanilla supervised multi\-class classifier, so any failure here cannot be attributed to reward sparsity, bootstrapping instability, or target\-network dynamics\.

The DeepQQ\-Network\(Mnihet al\.,[2013](https://arxiv.org/html/2606.04161#bib.bib9)\)is a larger MLP \(d→128→64→Kd\\to 128\\to 64\\to K\) trained with the one\-step bandit objective \(γ=0\\gamma\{=\}0\), Adam, soft target updates with rateτ=0\.05\\tau\{=\}0\.05, and a replay buffer of the full offlineℬ\\mathcal\{B\}\. We use the canonical\{0,1\}\\\{0,1\\\}oracle\-match rewardR\(s,a\)=𝕀\[ma\(s\)predicts the true label\]R\(s,a\)=\\mathbb\{I\}\[m\_\{a\}\(s\)\\text\{ predicts the true label\}\], which matches the deployment metric and avoids the policy collapse that arises under log\-probability shaping \(see Appendix[C](https://arxiv.org/html/2606.04161#A3)for the reward\-sensitivity analysis\)\. The network is trained for 50 epochs on the same buffer used by BC\.

To control for offline\-RL conservatism, we also train a ConservativeQQ\-Learning \(CQL\)\(Kumaret al\.,[2020](https://arxiv.org/html/2606.04161#bib.bib8)\)variant on the same buffer, network, and reward\. CQL adds the standard penaltyα𝔼s\[log∑a′exp⁡Q\(s,a′\)−Q\(s,a⋆\)\]\\alpha\\,\\mathbb\{E\}\_\{s\}\\\!\\left\[\\log\\sum\_\{a^\{\\prime\}\}\\exp Q\(s,a^\{\\prime\}\)\-Q\(s,a^\{\\star\}\)\\right\]to the Bellman loss, suppressingQQ\-values for actions outside the buffer’s support\. We sweepα∈\{0\.1,1\.0,5\.0\}\\alpha\\in\\\{0\.1,1\.0,5\.0\\\}\. We anchor the second term on the hard buffer\-oracle actiona⋆a^\{\\star\}instead of𝔼a∼π^β\(a∣s\)\[Q\(s,a\)\]\\mathbb\{E\}\_\{a\\sim\\hat\{\\pi\}\_\{\\beta\}\(a\\mid s\)\}\[Q\(s,a\)\]; under the deterministic CV\-induced label distribution this collapses to the standard form and avoids estimating a stochastic behavior policy\. Becauseγ=0\\gamma\{=\}0, the conservatism term is the only thing distinguishing CQL from DQN here, which isolates the contribution of pessimism from any bootstrapping effect\.

For each policy we report*oracle agreement*\(the fraction of test samples on which the policy’s selected action matches the hard evaluation\-oracle label\) and*test accuracy*\(the accuracy of the selected base model’s prediction on those samples\)\. The first measures how well the policy imitates the oracle; the second measures whether imitation translates into useful predictions\. A low oracle agreement combined with near\-reference accuracy is consistent with a policy defaulting to a strong single model regardless ofss\.

### 4\.3Stage 3: Feature marginal value via state ablations

If the 28 engineered behavioral features carry unique signal for model selection \(information not already encoded in the 5 base\-model probabilities\)*that the meta\-learner can exploit*, we would expect removing them to reduce held\-out accuracy\. We construct a second pair of policies \(BC and a probabilities\-only MLP baseline\) whose input is the 5\-dimensional probability vector alone, training them on the same buffer with matched optimizer settings\. The difference in test accuracy between full\-state \(d=38d\{=\}38\) and probabilities\-only \(d=5d\{=\}5\) variants isolates the marginal contribution of the engineered features to the meta\-learning task, separately from their well\-established contribution to the*base\-model*prediction task\. \(The base models still use the full 28 behavioral features during their own training\.\) A null difference is consistent with the 28 features being redundant for model selection under our learners and sample size, despite their established value for the base\-model prediction task; it does not, on its own, rule out an information\-bearing signal that a different architecture or larger buffer could exploit\.

We then augment the 38\-dimensional state with 13 disagreement\-derived transforms of the same base\-model probability vector\. These include the probability standard deviation, all pairwise absolute differences, the predictive entropy of the mean probability, and the top\-1/top\-2 margin\. They serve as a representation\-bias test\. They are also deterministic transforms of quantities already in the state and so do not add independent information; a failure to improve held\-out accuracy is evidence against this particular tweak, but not against any future augmentation\.

## 5Experimental Setup

This section specifies the dataset, sampling, features, base models, and buffer construction underlying every experiment in[Section6](https://arxiv.org/html/2606.04161#S6)\. Each choice is reproducible from the released configuration\. The only source of variance across runs is the random seed controlling sample draws, model initialization, and the stochastic cross\-validation partition\.

### 5\.1Dataset and inclusion criteria

We use edX clickstream data from 52 offerings of the same introductory computer science course, covering223,505223\{,\}505student\-course pairs and 84\.5 million events\. We retain learners with at least 10 events, at least 3 active days, and at least 28 days of activity span, yielding34,30334\{,\}303qualifying pairs from21,99021\{,\}990students\. For each experimental configuration we draw a stratified sample of1,0001\{,\}000student\-course pairs, so that all 16 window configurations and 5\-seed sweeps are evaluated under the same per\-configuration budget\. The resulting diagnostics are estimates under this sampled regime, not full\-corpus limits\. The corpus mixes 38 MOOC offerings and 9 in\-person sections, so the state includes a binary modality flag\. Additional cohort and feature details are in Appendix[A](https://arxiv.org/html/2606.04161#A1)and[Table3](https://arxiv.org/html/2606.04161#A1.T3)\.

### 5\.2Windows and labels

For each sampled pair we construct an observation window followed by a prediction window\. The main configuration is\(14d,14d\)\(14\\text\{d\},14\\text\{d\}\)\. Dropout is defined as no events in the prediction window, giving a 33\.7% positive rate\. We also evaluate the full4×44\\times 4grid withTobs,Tpred∈\{7,14,21,28\}T\_\{\\mathrm\{obs\}\},T\_\{\\mathrm\{pred\}\}\\in\\\{7,14,21,28\\\}days\. We center on\(14d,14d\)\(14\\text\{d\},14\\text\{d\}\)because shorter horizons are noisier and longer horizons are less actionable for intervention\.

### 5\.3Engineered features

From each observation window we extract 28 aggregate behavioral features covering activity volume, content ratios, consistency, temporal placement, trends, and concentration of activity\. All features are standardized for the linear components of the pipeline\. The feature taxonomy and examples are in[Table3](https://arxiv.org/html/2606.04161#A1.T3)\.

### 5\.4Base models

We train five base classifiers on the full training set using all 28 engineered features plus the binary modality flag\.LRis L2\-regularized logistic regression\.RFis a random forest \(400 trees, depth 16, min\-split 2\)\.GBis gradient boosting \(300 trees, lr 0\.05, depth 3\)\.Calibrated RFis an isotonic\-CV calibrated random forest\.Stacking\(Wolpert,[1992](https://arxiv.org/html/2606.04161#bib.bib14)\)is a logistic\-regression meta\-learner over LR, RF, and GB\. Each model outputs a scalar dropout probabilitypi\(s\)∈\[0,1\]p\_\{i\}\(s\)\\in\[0,1\]\. The meta\-learning action set𝒜=\{1,…,5\}\\mathcal\{A\}=\\\{1,\\ldots,5\\\}corresponds to these five models\.

### 5\.5Buffer construction and evaluation protocol

We split each1,0001\{,\}000\-pair sample into stratified train \(800800\) and test \(200200\) partitions\. The offline buffer is built from 4\-fold cross\-validation on the training split, yielding out\-of\-fold base\-model probabilities and hard oracle actions without leakage\. The main state has 38 dimensions: 28 behavioral features, 1 modality flag, 5 base\-model probabilities, 1 mean probability, and a 3\-way one\-hot risk bin\. The disagreement augmentation adds 13 deterministic transforms of the probability vector, yielding a 51\-dimensional augmented state\. When we report the*best static*or*strongest single\-model*reference, it is the single base model with the highest accuracy on that held\-out test split\. We report it as a hindsight reference, not as a deployable model\-selection rule\.

All reported numbers are mean±\\pmstandard deviation across 5 random seeds, controlling the sample draw, train/test split, and CV partition\. We report test accuracy and oracle agreement throughout; Appendix[H](https://arxiv.org/html/2606.04161#A8)provides training details and the bootstrap protocol used for paired intervals\. The main tables combine three uncertainty sources at once: resampling of examples, resplitting of train and test, and stochastic model training\.

A sample\-size sensitivity analysis on this main configuration, holding the test partition fixed and subsampling only the training buffer, is reported in[Table2](https://arxiv.org/html/2606.04161#S6.T2)\. Code, experiment configurations, and aggregated CSV outputs underlying every table and figure are released at[https://anonymous\.4open\.science/r/gtedm\-icml26](https://anonymous.4open.science/r/gtedm-icml26)\.222Anonymized for double\-blind review\. The link will be replaced with the public repository at camera\-ready\.

## 6Results

We first establish the amount of oracle headroom available, then summarize the three\-stage diagnostic on the main configuration, and finally quantify how much buffer\-to\-test oracle shift remains after that diagnosis\.[Figure3](https://arxiv.org/html/2606.04161#S6.F3)gives a compact visual summary of the main case; Appendix[B](https://arxiv.org/html/2606.04161#A2)reports the raw sample counts and action marginals behind it\.

### 6\.1Oracle headroom exists across window configurations

![Refer to caption](https://arxiv.org/html/2606.04161v1/x1.png)Figure 2:Oracle gap across the 16 observation/prediction window configurations\. Values are absolute accuracy gains of the per\-instance oracle over the strongest single\-model reference on the same test split\. The highlighted\(14d,14d\)\(14\\text\{d\},14\\text\{d\}\)configuration lies near the middle of the observed oracle\-gap range while preserving a plausible intervention horizon\. Full window\-level selector results are reported in[Table6](https://arxiv.org/html/2606.04161#A4.T6)\.Across the 16 window configurations, the oracle improves on the strongest single base model by 4\.5 to 15\.5 points, with a mean gap of 9\.7 points\. On the main\(14d,14d\)\(14\\text\{d\},14\\text\{d\}\)configuration, the strongest single base model reaches0\.762±0\.0240\.762\\pm 0\.024and the oracle reaches0\.8250\.825, leaving a 6\.3\-point gap for an adaptive selector to recover\.

[Table6](https://arxiv.org/html/2606.04161#A4.T6)shows that some other windows offer larger oracle gaps, but they correspond either to very early, noisier prediction settings or to later windows where the downstream intervention is less actionable\. The main configuration sits near the middle of the headroom range while preserving a plausible educational intervention horizon\.

### 6\.2Oracle imitation improves, but not deployment accuracy

Table 1:Main\(14d,14d\)\(14\\text\{d\},14\\text\{d\}\)results\. “Best static ref\.” is the strongest single base model on that held\-out test split, reported as a hindsight reference\. The first block uses the 800\-train/200\-test regime that supports the BC/DQN/CQL mechanism comparison\. The bottom block verifies the headline negative claim atN=2,000N\{=\}2\{,\}000test examples with an 8,000\-buffer training set \(full sweep in[Table2](https://arxiv.org/html/2606.04161#S6.T2)\)\. DQN here uses the canonical\{0,1\}\\\{0,1\\\}oracle\-match reward; the original log\-probability shaping reward yields0\.158±0\.0250\.158\\pm 0\.025oracle agreement and0\.749±0\.0270\.749\\pm 0\.027test accuracy on the same seeds \(Appendix[C](https://arxiv.org/html/2606.04161#A3)\)\. CQL adds the standard conservative\-QQpenalty to the same one\-step bandit objective, swept across three penalty weightsα\\alpha\. BC\-full minus best\-static\-reference accuracy is−0\.010\-0\.010\(95% CI\[−0\.019,−0\.001\]\[\-0\.019,\-0\.001\]\) in the small\-NNregime and0\.000±0\.0010\.000\\pm 0\.001atN=2,000N\{=\}2\{,\}000\. Probs\-only BC minus BC\-full is−0\.004\-0\.004\(CI\[−0\.017,0\.010\]\[\-0\.017,0\.010\]\)\. Disagreement\-augmented BC minus BC\-full is−0\.005\-0\.005\(CI\[−0\.014,0\.004\]\[\-0\.014,0\.004\]\)\.0\.700\.750\.800\.85OracleBest staticBC \(full\)CQL \(α=1\\alpha\{=\}1\)BC \(probs\)Held\-out 10\-NNBC \(\+disagree\)CQL \(α=5\\alpha\{=\}5\)DQN \(oracle\)Test accuracy \([Table1](https://arxiv.org/html/2606.04161#S6.T1); static ref dashed\)Figure 3:Every learned selector clusters within\[0\.74,0\.77\]\[0\.74,0\.77\]near the static reference\. The oracle headroom of0\.0630\.063is not recovered\. Bars:±1\\pm 1std over five seeds\.The local\-consistency diagnostic at\(14d,14d\)\(14\\text\{d\},14\\text\{d\}\)is0\.388±0\.0100\.388\\pm 0\.010\. A held\-out1010\-NN selector in the same standardized state space reaches0\.486±0\.0410\.486\\pm 0\.041oracle agreement and0\.747±0\.0370\.747\\pm 0\.037test accuracy, which calibrates local recoverability from the current representation\. BC, DQN, and CQL cluster between0\.360\.36and0\.510\.51oracle agreement, above random \(1/K=0\.201/K=0\.20\) but below the1010\-NN ceiling\. CQL matches BC’s imitation \(0\.510\.51atα=1\.0\\alpha\{=\}1\.0\) while DQN reaches only0\.360\.36\. The gap between them isolates a missing pessimism term, since the supervisory signal is the same\. The log\-probability shaping reward originally used for DQN collapses oracle agreement below random because it is computed relative to LR; Appendix[C](https://arxiv.org/html/2606.04161#A3)shows test accuracy is roughly constant across reward variants \(0\.7430\.743to0\.7550\.755\)\.

Yet none of the four learners exceeds the static reference of0\.7620\.762on held\-out accuracy\. Every method sits within±0\.01\\pm 0\.01of0\.7480\.748\([Figure3](https://arxiv.org/html/2606.04161#S6.F3)\)\. The state ablations show the same pattern: probabilities\-only BC nearly matches the full state, and disagreement\-derived transforms improve oracle agreement slightly but leave test accuracy unchanged\. Of the paired intervals in[Table1](https://arxiv.org/html/2606.04161#S6.T1), only BC\-full versus best static excludes zero, and in the wrong direction; both representation comparisons span zero\. Richer offline encodings of the same base\-model outputs do not measurably improve deployment accuracy here\. The 200\-example test split is not the cause: the bottom block of[Table1](https://arxiv.org/html/2606.04161#S6.T1)shows that atN=2,000N\{=\}2\{,\}000held\-out examples and an8,0008\{,\}000\-example training buffer, BC’s test accuracy \(0\.753±0\.0010\.753\\pm 0\.001\) is statistically indistinguishable from the hindsight static reference \(0\.753±0\.0000\.753\\pm 0\.000\) and DQN remains below both\. Full 16\-window results appear in[Table6](https://arxiv.org/html/2606.04161#A4.T6)\.

### 6\.3The negative result is stable across windows and buffer sizes

[Table6](https://arxiv.org/html/2606.04161#A4.T6)shows that the\(14d,14d\)\(14\\text\{d\},14\\text\{d\}\)main case is not a one\-off failure\. Across all 16 observation/prediction windows, BC without disagreement augmentation trails the strongest single base model by 0\.7 to 3\.8 accuracy points, with a mean deficit of 1\.6 points\. The disagreement\-augmented selector also fails to close the gap: across the same windows it trails that reference by 0\.2 to 3\.2 points, with a mean deficit of 1\.9 points\.

The window sweep also clarifies what disagreement features change\. Augmentation improves oracle agreement in every window, by 0\.4 to 11\.9 points with a mean gain of 3\.0 points; the corresponding held\-out accuracy change averages−0\.003\-0\.003and never exceeds\+0\.009\+0\.009\. Deterministic transforms of the probability vector increase oracle agreement without producing a meaningful deployment benefit, which supports the same representational diagnosis\.

#### Sample\-size sweep

![Refer to caption](https://arxiv.org/html/2606.04161v1/x2.png)Figure 4:Sample\-size sensitivity at\(14d,14d\)\(14\\text\{d\},14\\text\{d\}\)\. Training buffer varied over\{800,1,600,4,000,8,000\}\\\{800,1\{,\}600,4\{,\}000,8\{,\}000\\\}with the20%20\\%test split held fixed across rows\. Bands: mean±\\pmstd over five seeds\.Table 2:Sample\-size sensitivity numbers underlying[Figure4](https://arxiv.org/html/2606.04161#S6.F4)\. “static\-hind\.” is the strongest single base model on the held\-out test split, picked with hindsight\. “static\-deploy\.” is the base model selected on training accuracy and evaluated on the same test split, a deployable selector\. The held\-out test partition contains2,0002\{,\}000examples and is fixed across rows\.A natural reading of the main\-text gap of−0\.010\-0\.010between BC and the strongest single base model is that 200 test examples is too few to distinguish the two\. To test that explanation, we hold the test partition fixed at2,0002\{,\}000examples and vary only the training\-buffer size\. This addresses both the sample\-size concern and the test\-stability concern flagged by the diagnostic protocol’s own design\.[Table2](https://arxiv.org/html/2606.04161#S6.T2)reports the result\.

The negative deployment claim survives at every buffer size\. BC test accuracy rises from0\.740±0\.0040\.740\\pm 0\.004at 800 training examples to0\.753±0\.0010\.753\\pm 0\.001at8,0008\{,\}000, but so does the hindsight static reference, and the gap between them collapses to a mean difference of roughly zero \(\+0\.000±0\.001\+0\.000\\pm 0\.001at8,0008\{,\}000\) without crossing into a positive gain\. Against the deployable static reference \(the base model picked from training\-set accuracy\), BC actually*leads*by 0\.6 to 0\.9 points across all sizes\. We treat that lead as suggestive because it is small and depends on a deployment story whose details \(training\-set picking rule, intervention budget, evaluation metric\) are specific to this task\.

The sweep also clarifies the diagnostic story underneath that result\. Localkk\-NN consistency is roughly flat in the buffer size \(0\.381→0\.3560\.381\\to 0\.356as the buffer grows by an order of magnitude\), consistent with the local\-ambiguity diagnosis not being a small\-sample artifact\. Buffer\-to\-test oracle shift, by contrast, decreases monotonically \(dTV=0\.046→0\.028d\_\{\\mathrm\{TV\}\}=0\.046\\to 0\.028\), consistent with a buffer that better approximates the test\-time marginal as it grows\. Neither change moves deployment accuracy materially, which mirrors the single\-cache analysis in[Section6\.5](https://arxiv.org/html/2606.04161#S6.SS5)\.

### 6\.4Hard\-label tie\-breaking does not drive the negative result

The hard\-argmax oracle treats samples on which multiple base models are simultaneously correct as if the choice between them mattered\. The tie rate \(the buffer fraction with≥2\\geq 2correct base models\) is0\.778±0\.0080\.778\\pm 0\.008\. We therefore also report mean per\-sample regret: the fraction of test samples on which the policy selects an incorrect model when at least one correct model exists \(the oracle attains0\)\.

Two things follow\. First, oracle agreement spans0\.360\.36to0\.520\.52while regret clusters tightly in\[0\.089,0\.101\]\[0\.089,0\.101\], so hard\-label agreement overstates how much BC, DQN, and CQL differ on a deployment\-relevant metric\. Second, the strongest base model still leads in regret \(0\.0590\.059\), so the negative result is not a tie\-breaking artifact\. The high tie rate does qualify Stage 1: with0\.7780\.778of buffer samples having≥2\\geq 2correct base models, thekk\-NN consistency of0\.3880\.388partly reflects arg\-max noise among near\-tied probabilities, and is best read as an upper bound on the signal recoverable by any hard\-label local rule\.

### 6\.5Buffer\-to\-test oracle shift is locally non\-trivial

![Refer to caption](https://arxiv.org/html/2606.04161v1/x3.png)Figure 5:Buffer\-to\-test marginaldTV\(πβ,π⋆\)d\_\{\\mathrm\{TV\}\}\(\\pi\_\{\\beta\},\\pi^\{\\star\}\)across the 16 windows\. All values are positive but modest\. Main configuration:0\.070±0\.0150\.070\\pm 0\.015\. Full table and controlled\-classifier analysis in[Table7](https://arxiv.org/html/2606.04161#A6.T7)\.[Figure5](https://arxiv.org/html/2606.04161#S6.F5)shows that the oracle\-label distribution induced by the offline buffer is not identical to the test\-time oracle\. Across all 16 window configurations,dTV\(πβ,π⋆\)d\_\{\\mathrm\{TV\}\}\(\\pi\_\{\\beta\},\\pi^\{\\star\}\)ranges from0\.0480\.048to0\.0830\.083, with a mean of0\.063±0\.0110\.063\\pm 0\.011\. On the main configuration it is0\.070±0\.0150\.070\\pm 0\.015\.

The controlled classifier analysis summarized in[Table7](https://arxiv.org/html/2606.04161#A6.T7)reaches the same conclusion\. On\(14d,14d\)\(14\\text\{d\},14\\text\{d\}\), a classifier trained to imitate the buffer oracle is closer to the buffer than to the test oracle by\+0\.016±0\.040\+0\.016\\pm 0\.040, confirming that shift is in the right direction to hurt deployment\. A separate robustness check using external stratifiers is reported in Appendix[G](https://arxiv.org/html/2606.04161#A7)\.

The marginaldTVd\_\{\\mathrm\{TV\}\}in[Equation2](https://arxiv.org/html/2606.04161#S3.E2)averages over the state, so a small marginal can hide substantial pointwise shift\. We therefore also estimate the conditional shift𝔼s\[dTV\(πβ\(⋅∣s\),π⋆\(⋅∣s\)\)\]\\mathbb\{E\}\_\{s\}\[d\_\{\\mathrm\{TV\}\}\(\\pi\_\{\\beta\}\(\\cdot\\mid s\),\\pi^\{\\star\}\(\\cdot\\mid s\)\)\]by approximating each per\-state distribution from itsk=10k\{=\}10nearest neighbors \(buffer\-side forπβ\\pi\_\{\\beta\}, test\-side forπ⋆\\pi^\{\\star\}\)\. On the main configuration the conditionaldTVd\_\{\\mathrm\{TV\}\}averages0\.288±0\.0240\.288\\pm 0\.024, roughly four times the marginal, with9595th percentile0\.521±0\.0400\.521\\pm 0\.040and worst\-state0\.860±0\.0490\.860\\pm 0\.049\. Pointwise shift is therefore much larger than the marginal suggests, with a small tail of states approaching full disagreement\.

If local shift were the dominant failure mode, CQL’s conservatism term \(which explicitly tightens towardπβ\\pi\_\{\\beta\}\) should hurt deployment more than DQN’s unregularized objective\. It does not\. BC, DQN, and CQL differ sharply in oracle imitation \(0\.360\.36to0\.510\.51\) but converge on test accuracy \(0\.7430\.743to0\.7530\.753, within0\.010\.01across all three CQL penalty weights\)\. Local support mismatch is substantial and works against deployment, but it is not the main driver of the result\. The offline state simply does not identify a deployment\-winning action reliably enough, and that limitation is shared across learners regardless of how aggressively they regularize toward the buffer\.

## 7Discussion: An Offline Diagnostic Procedure

The combined diagnostic procedure points to a locally ambiguous state representation as the primary failure mode\. Held\-out accuracy is nearly unmoved by disagreement\-derived feature transforms, by adding the canonical offline\-RL conservatism term, or by enlarging the buffer tenfold\. Buffer\-to\-test oracle shift moves it only marginally, even when the shift is measured per\-state\.

### 7\.1Why offline\-RL constraints may not close this gap

Conservative offline\-RL methods \(CQL\(Kumaret al\.,[2020](https://arxiv.org/html/2606.04161#bib.bib8)\), IQL\(Kostrikovet al\.,[2021](https://arxiv.org/html/2606.04161#bib.bib10)\), BRAC\(Wuet al\.,[2019](https://arxiv.org/html/2606.04161#bib.bib13)\), and BEAR\(Kumaret al\.,[2019](https://arxiv.org/html/2606.04161#bib.bib12)\)\) targetQQ\-value overestimation at unseen\(s,a\)\(s,a\)under bootstrap\. This task exhibits a different failure mode\. Settingγ=0\\gamma\{=\}0eliminates bootstrap by construction, and the ambiguity is in the\(s,a∗\)\(s,a^\{\*\}\)labelling itself\. Similar states disagree on the oracle action61\.2%61\.2\\%of the time\. Conservatism reweights the decision rule under the existing labels; the labels themselves are unchanged\. It therefore cannot recover signal that the\(s,a⋆\)\(s,a^\{\\star\}\)pairs do not already contain\.

[Table1](https://arxiv.org/html/2606.04161#S6.T1)bears this out\. Adding the CQL penalty to the same one\-step objective recovers oracle agreement of0\.510\.51atα=1\.0\\alpha\{=\}1\.0, comparable to BC and well above DQN’s0\.360\.36\. Acrossα∈\{0\.1,1\.0,5\.0\}\\alpha\\in\\\{0\.1,1\.0,5\.0\\\}, test accuracy moves by less than a point and never reaches the static reference of0\.7620\.762\. Conservatism closes the imitation gap without moving deployment, even though the conditional shift𝔼s\[dTV\]≈0\.29\\mathbb\{E\}\_\{s\}\[d\_\{\\mathrm\{TV\}\}\]\\approx 0\.29\([Figure5](https://arxiv.org/html/2606.04161#S6.F5)\) is substantial: the gap to the static reference appears to be limited by the representation rather than by distance toπβ\\pi\_\{\\beta\}\.

### 7\.2Using the diagnostics and benchmark implications

As a pre\-deployment checklist, the procedure first asks whether oracle headroom is large enough to justify any adaptive selector\. If headroom exists, it asks whether a simple local predictor can recover oracle actions from the current state\. When bothkk\-NN consistency and held\-out1010\-NN accuracy are low, additional offline\-RL tuning is unlikely to pay off; the next iteration should focus on state design, relabeling, or data collection\. Heavier offline learners or offline\-to\-online adaptation\(Mandelet al\.,[2014](https://arxiv.org/html/2606.04161#bib.bib20)\)are worth the cost only when local recoverability is already meaningful\. The same four quantities \(oracle gap,kk\-NN consistency, a held\-out local selector, and a buffer\-to\-test oracle\-shift measure\) make a negative offline\-selector result interpretable in benchmark settings\.

### 7\.3Fairness and intervention\-cost considerations

The inclusion criterion \(≥10\\geq 10events,≥3\\geq 3active days,≥28\\geq 28\-day activity span\) excludes roughly85%85\\%of enrolled learners, mostly casual browsers, so our claims are conditioned on the moderately engaged subpopulation\. A learned selector also triggers downstream human intervention, so selection errors that systematically favor one base model in one subpopulation may translate into inequitable intervention coverage\. An offline meta\-learner whose failure modes are unmapped across demographic strata is not yet suitable for intervention\-triggering use\.

## 8Limitations and Conclusion

All experiments draw from one introductory computer science curriculum at one institution; the diagnostics should be recomputed on any new dataset\. The inclusion criteria bias the sample toward moderately engaged learners, and the contextual\-bandit reduction \(γ=0\\gamma\{=\}0\) cannot model within\-course state transitions\. The offline setting also precludes the online exploration our diagnosis motivates, so we can diagnose offline limits but cannot measure what online adaptation would achieve here\.

On this task, the diagnostic procedure points to local label ambiguity as the primary offline bottleneck\. Adding conservatism does not close the gap in our experiments, and accounting for buffer\-to\-test shift does not either\. The combined procedure is a lightweight pre\-deployment checklist for deciding whether the offline state is informative enough to support model selection: when it flags low consistency and shared BC/DQN/CQL failure, the next iteration is more likely to pay off by changing the data than by changing the learner\.

## References

- Á\. Casado Hidalgo, P\. Moreno\-Ger, and L\. de la Fuente\-Valentín \(2022\)Using meta\-learning to predict student performance in virtual learning environments\.Applied Intelligence52\(3\),pp\. 3352–3365\.External Links:[Document](https://dx.doi.org/10.1007/s10489-021-02613-x),[Link](https://doi.org/10.1007/s10489-021-02613-x)Cited by:[§1](https://arxiv.org/html/2606.04161#S1.p1.1),[§2](https://arxiv.org/html/2606.04161#S2.p2.1)\.
- R\. M\. O\. Cruz, R\. Sabourin, G\. D\. C\. Cavalcanti, and T\. I\. Ren \(2015\)META\-DES: A dynamic ensemble selection framework using meta\-learning\.Pattern Recognition48\(5\),pp\. 1925–1935\.External Links:[Document](https://dx.doi.org/10.1016/j.patcog.2014.12.003),[Link](https://doi.org/10.1016/j.patcog.2014.12.003)Cited by:[§1](https://arxiv.org/html/2606.04161#S1.p1.1),[§2](https://arxiv.org/html/2606.04161#S2.p2.1)\.
- R\. M\. O\. Cruz, R\. Sabourin, and G\. D\. C\. Cavalcanti \(2018\)Dynamic classifier selection: recent advances and perspectives\.Information Fusion41,pp\. 195–216\.External Links:[Document](https://dx.doi.org/10.1016/j.inffus.2017.09.010),[Link](https://doi.org/10.1016/j.inffus.2017.09.010)Cited by:[§1](https://arxiv.org/html/2606.04161#S1.p1.1),[§1](https://arxiv.org/html/2606.04161#S1.p3.1),[§2](https://arxiv.org/html/2606.04161#S2.p2.1)\.
- F\. Dalipi, A\. S\. Imran, and Z\. Kastrati \(2018\)MOOC dropout prediction using machine learning techniques: Review and research challenges\.In2018 IEEE Global Engineering Education Conference \(EDUCON\),pp\. 1007–1014\.External Links:[Document](https://dx.doi.org/10.1109/EDUCON.2018.8363340),[Link](https://ieeexplore.ieee.org/document/8363340)Cited by:[§2](https://arxiv.org/html/2606.04161#S2.p2.1)\.
- S\. Dass, K\. Gary, and J\. Cunningham \(2021\)Predicting student dropout in self\-paced MOOC course using random forest model\.Information12\(11\),pp\. 476\.External Links:[Document](https://dx.doi.org/10.3390/info12110476),[Link](https://doi.org/10.3390/info12110476)Cited by:[§2](https://arxiv.org/html/2606.04161#S2.p2.1)\.
- M\. Dudík, J\. Langford, and L\. Li \(2011\)Doubly robust policy evaluation and learning\.InProceedings of the 28th International Conference on Machine Learning \(ICML\),pp\. 1097–1104\.External Links:[Link](https://icml.cc/2011/papers/554_icmlpaper.pdf)Cited by:[§2](https://arxiv.org/html/2606.04161#S2.p2.1),[§3](https://arxiv.org/html/2606.04161#S3.SS0.SSS0.Px6.p1.2)\.
- S\. Fujimoto, D\. Meger, and D\. Precup \(2019\)Off\-policy deep reinforcement learning without exploration\.InProceedings of the 36th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.97,pp\. 2052–2062\.External Links:[Link](https://proceedings.mlr.press/v97/fujimoto19a.html)Cited by:[§2](https://arxiv.org/html/2606.04161#S2.p1.1)\.
- O\. Goren, L\. Cohen, and A\. Rubinstein \(2024\)Early prediction of student dropout in higher education using machine learning models\.InProceedings of the 17th International Conference on Educational Data Mining \(EDM\),pp\. 349–359\.External Links:[Link](https://educationaldatamining.org/edm2024/proceedings/2024.EDM-short-papers.32/index.html)Cited by:[§2](https://arxiv.org/html/2606.04161#S2.p2.1)\.
- A\. H\. Ko, R\. Sabourin, and A\. d\. S\. Britto \(2008\)From dynamic classifier selection to dynamic ensemble selection\.Pattern Recognition41\(5\),pp\. 1718–1731\.External Links:[Document](https://dx.doi.org/10.1016/j.patcog.2007.10.015),[Link](https://doi.org/10.1016/j.patcog.2007.10.015)Cited by:[§1](https://arxiv.org/html/2606.04161#S1.p3.1),[§2](https://arxiv.org/html/2606.04161#S2.p2.1)\.
- I\. Kostrikov, A\. Nair, and S\. Levine \(2021\)Offline reinforcement learning with implicit Q\-learning\.arXiv preprint arXiv:2110\.06169\.External Links:[Link](https://arxiv.org/abs/2110.06169)Cited by:[§2](https://arxiv.org/html/2606.04161#S2.p1.1),[§7\.1](https://arxiv.org/html/2606.04161#S7.SS1.p1.6)\.
- A\. Kumar, J\. Fu, M\. Soh, G\. Tucker, and S\. Levine \(2019\)Stabilizing off\-policy q\-learning via bootstrapping error reduction\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.32,pp\. 11761–11771\.External Links:[Link](https://papers.neurips.cc/paper/2019/hash/c2073ffa77b5357a498057413bb09d3a-Abstract.html)Cited by:[§2](https://arxiv.org/html/2606.04161#S2.p1.1),[§7\.1](https://arxiv.org/html/2606.04161#S7.SS1.p1.6)\.
- A\. Kumar, A\. Zhou, G\. Tucker, and S\. Levine \(2020\)Conservative q\-learning for offline reinforcement learning\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.33,pp\. 1179–1191\.External Links:[Link](https://papers.nips.cc/paper/2020/hash/0d2b2061826a5df3221116a5085a6052-Abstract.html)Cited by:[§1](https://arxiv.org/html/2606.04161#S1.p1.1),[§1](https://arxiv.org/html/2606.04161#S1.p3.1),[§2](https://arxiv.org/html/2606.04161#S2.p1.1),[§2](https://arxiv.org/html/2606.04161#S2.p2.1),[§4\.2](https://arxiv.org/html/2606.04161#S4.SS2.p4.7),[§7\.1](https://arxiv.org/html/2606.04161#S7.SS1.p1.6)\.
- A\. S\. Lan and R\. G\. Baraniuk \(2016\)A contextual bandits framework for personalized learning action selection\.InProceedings of the 9th International Conference on Educational Data Mining \(EDM\),pp\. 424–429\.External Links:[Link](https://educationaldatamining.org/EDM2016/proceedings/paper_63.pdf)Cited by:[§2](https://arxiv.org/html/2606.04161#S2.p1.1),[§3](https://arxiv.org/html/2606.04161#S3.SS0.SSS0.Px3.p1.3)\.
- S\. Levine, A\. Kumar, G\. Tucker, and J\. Fu \(2020\)Offline reinforcement learning: tutorial, review, and perspectives on open problems\.arXiv preprint arXiv:2005\.01643\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2005.01643),[Link](https://arxiv.org/abs/2005.01643)Cited by:[§1](https://arxiv.org/html/2606.04161#S1.p1.1),[§2](https://arxiv.org/html/2606.04161#S2.p1.1)\.
- Y\. Liu, S\. Fan, S\. Xu, A\. Sajjanhar, S\. Yeom, and Y\. Wei \(2023\)Predicting student performance using clickstream data and machine learning\.Education Sciences13\(1\),pp\. 17\.External Links:[Document](https://dx.doi.org/10.3390/educsci13010017),[Link](https://doi.org/10.3390/educsci13010017)Cited by:[§1](https://arxiv.org/html/2606.04161#S1.p1.1)\.
- I\. Lykourentzou, I\. Giannoukos, V\. Nikolopoulos, G\. Mpardis, and V\. Loumos \(2009\)Dropout prediction in e\-learning courses through the combination of machine learning techniques\.Computers & Education53\(3\),pp\. 950–965\.External Links:[Document](https://dx.doi.org/10.1016/j.compedu.2009.05.010),[Link](https://doi.org/10.1016/j.compedu.2009.05.010)Cited by:[§2](https://arxiv.org/html/2606.04161#S2.p2.1)\.
- T\. Mandel, Y\. Liu, S\. Levine, E\. Brunskill, and Z\. Popović \(2014\)Offline policy evaluation across representations with applications to educational games\.InProceedings of the 13th International Conference on Autonomous Agents and Multiagent Systems \(AAMAS\),pp\. 1077–1084\.External Links:[Link](https://www.ifaamas.org/Proceedings/aamas2014/aamas/p1077.pdf)Cited by:[§2](https://arxiv.org/html/2606.04161#S2.p1.1),[§7\.2](https://arxiv.org/html/2606.04161#S7.SS2.p1.3)\.
- V\. Mnih, K\. Kavukcuoglu, D\. Silver, A\. Graves, I\. Antonoglou, D\. Wierstra, and M\. Riedmiller \(2013\)Playing atari with deep reinforcement learning\.arXiv preprint arXiv:1312\.5602\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.1312.5602),[Link](https://arxiv.org/abs/1312.5602)Cited by:[§1](https://arxiv.org/html/2606.04161#S1.p2.2),[§4\.2](https://arxiv.org/html/2606.04161#S4.SS2.p3.7)\.
- J\. R\. Rice \(1976\)The algorithm selection problem\.Advances in Computers15,pp\. 65–118\.External Links:[Document](https://dx.doi.org/10.1016/S0065-2458%2808%2960520-3),[Link](https://doi.org/10.1016/S0065-2458(08)60520-3)Cited by:[§1](https://arxiv.org/html/2606.04161#S1.p1.1),[§2](https://arxiv.org/html/2606.04161#S2.p2.1)\.
- A\. Swaminathan and T\. Joachims \(2015\)Counterfactual risk minimization: learning from logged bandit feedback\.InProceedings of the 32nd International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.37,pp\. 814–823\.External Links:[Link](https://proceedings.mlr.press/v37/swaminathan15.html)Cited by:[§2](https://arxiv.org/html/2606.04161#S2.p2.1),[§3](https://arxiv.org/html/2606.04161#S3.SS0.SSS0.Px6.p1.2)\.
- C\. Taylor, K\. Veeramachaneni, and U\. O’Reilly \(2014\)Likely to stop? predicting stopout in massive open online courses\.arXiv preprint arXiv:1408\.3382\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.1408.3382),[Link](https://arxiv.org/abs/1408.3382)Cited by:[§1](https://arxiv.org/html/2606.04161#S1.p1.1),[§2](https://arxiv.org/html/2606.04161#S2.p2.1)\.
- D\. H\. Wolpert \(1992\)Stacked generalization\.Neural Networks5\(2\),pp\. 241–259\.External Links:[Document](https://dx.doi.org/10.1016/S0893-6080%2805%2980023-1),[Link](https://doi.org/10.1016/S0893-6080(05)80023-1)Cited by:[§5\.4](https://arxiv.org/html/2606.04161#S5.SS4.p1.2)\.
- Y\. Wu, G\. Tucker, and O\. Nachum \(2019\)Behavior regularized offline reinforcement learning\.arXiv preprint arXiv:1911\.11361\.External Links:[Link](https://arxiv.org/abs/1911.11361)Cited by:[§2](https://arxiv.org/html/2606.04161#S2.p1.1),[§7\.1](https://arxiv.org/html/2606.04161#S7.SS1.p1.6)\.
- W\. Xing, X\. Chen, J\. Stein, and M\. Marcinkowski \(2016\)Temporal predication of dropouts in MOOCs: reaching the low hanging fruit through stacking generalization\.Computers in Human Behavior58,pp\. 119–129\.External Links:[Document](https://dx.doi.org/10.1016/j.chb.2015.12.007),[Link](https://doi.org/10.1016/j.chb.2015.12.007)Cited by:[§2](https://arxiv.org/html/2606.04161#S2.p2.1)\.

## Appendix AFeature and Setup Details

The filtered edX corpus contains34,30334\{,\}303qualifying student\-course pairs from21,99021\{,\}990students after requiring at least 10 events, at least 3 active days, and at least 28 days of activity span\. These filters bias the analysis toward moderately engaged learners, but they ensure that every sample supports the observation/prediction windows evaluated in the main paper\.

From each observation window we derive 28 engineered behavioral features in six categories\. These complement the modality flag and base\-model probability vector used by the selector\.[Table3](https://arxiv.org/html/2606.04161#A1.T3)summarizes the taxonomy used throughout the paper\.

Table 3:Feature categories with representative examples\. All 28 features enter the selector state as standardized scalars\.
## Appendix BMain\-Case Diagnostics and Raw Counts

The body’s[Figure3](https://arxiv.org/html/2606.04161#S6.F3)compresses the main\(14d,14d\)\(14\\text\{d\},14\\text\{d\}\)test\-accuracy comparison into one view\. BC, DQN, CQL, and held\-out1010\-NN all remain near the static reference of0\.7620\.762while the per\-instance oracle reaches0\.8250\.825\. Each seed evaluates exactly 200 held\-out examples, so the0\.0100\.010accuracy gap between BC\-full and the strongest single base model corresponds to roughly two test examples per seed\. That scale is easy to lose in the mean\-based summaries reported in the body\. The corresponding buffer/test counts and main\-case action marginals appear in[Table4](https://arxiv.org/html/2606.04161#A2.T4)\.

Table 4:Main\(14d,14d\)\(14\\text\{d\},14\\text\{d\}\)action marginals averaged over 5 seeds\. Each seed uses an 800\-example CV buffer and a 200\-example held\-out test split\. The same regime yieldskk\-NN consistency0\.388±0\.0100\.388\\pm 0\.010, BC buffer\-oracle agreement0\.541±0\.0120\.541\\pm 0\.012, BC test\-oracle agreement0\.508±0\.0410\.508\\pm 0\.041,dTV\(πβ,π⋆\)=0\.070±0\.015d\_\{\\mathrm\{TV\}\}\(\\pi\_\{\\beta\},\\pi^\{\\star\}\)=0\.070\\pm 0\.015, and a classifier\-to\-buffer minus classifier\-to\-test distance of\+0\.016±0\.040\+0\.016\\pm 0\.040\.

## Appendix CDQN Action\-Marginal Diagnosis

The body[Table1](https://arxiv.org/html/2606.04161#S6.T1)reports DQN under the canonical\{0,1\}\\\{0,1\\\}oracle\-match reward, which attains0\.356±0\.0380\.356\\pm 0\.038oracle agreement on the\(14d,14d\)\(14\\text\{d\},14\\text\{d\}\)configuration\. The log\-probability shaping reward used in earlier drafts attains only0\.158±0\.0250\.158\\pm 0\.025oracle agreement on the same seeds, below the1/K=0\.201/K=0\.20uniform\-random baseline onK=5K\{=\}5actions\. This appendix explains why, by holding the buffer, base models, CV partition, and DQN architecture fixed and sweeping only the reward shaping function defined in[Section4](https://arxiv.org/html/2606.04161#S4)\. Five reward variants are evaluated on the same\(14d,14d\)\(14\\text\{d\},14\\text\{d\}\)buffer with five seeds:logprob\\mathrm\{logprob\}\(the original variant\),clipped\\mathrm\{clipped\}correctness,oracle\_match\\mathrm\{oracle\\\_match\}\(binary correctness on the chosen base model, used in the body\),margin\\mathrm\{margin\}, and ahybrid\\mathrm\{hybrid\}correctness\-plus\-margin combination\.

Table 5:DQN action\-marginal sweep on\(14d,14d\)\(14\\text\{d\},14\\text\{d\}\), five seeds\. “oracle agree” is the fraction of test samples on which the DQN’s chosen action equals the test oracle’s\. “top share” is the largest DQN selected\-action marginal\. The five right\-most columns are the DQN’s selected\-action marginal on the test set, one column per base model\. The “oracle \(ref\)” row reports the test\-set oracle’s marginal as a reference target\.Two patterns explain the headline number\. First, thelogprob\\mathrm\{logprob\}reward used in earlier drafts is computed relative to the LR baseline,Rlog\(s,a\)=log⁡pa\(y\)\(s\)−log⁡pLR\(y\)\(s\)R\_\{\\mathrm\{log\}\}\(s,a\)=\\log p\_\{a\}^\{\(y\)\}\(s\)\-\\log p\_\{\\mathrm\{LR\}\}^\{\(y\)\}\(s\)\. On samples where LR is itself the oracle action, the chosen and baseline log\-probabilities cancel and the buffer reward for that sample is exactly zero\. On samples where another model dominates LR, the reward is positive\. The DQN therefore observes that selecting LR is never net\-positive in the buffer, and across all1,0001\{,\}000test predictions in the five\-seed sweep it*never selects LR*, despite the test oracle picking LR 270 times\. The LR\-marginal cell in[Table5](https://arxiv.org/html/2606.04161#A3.T5)for thelogprob\\mathrm\{logprob\}row is exactly0\.0000\.000\. The0\.1580\.158logprob oracle agreement is dominated by these unrecoverable LR samples\.

Second, the failure is reward\-specific\. Theoracle\_match\\mathrm\{oracle\\\_match\}reward \(binary correctness, no LR\-relative shaping\) recovers an LR marginal of0\.2320\.232, close to the oracle’s0\.2700\.270, and lifts oracle agreement from0\.158±0\.0260\.158\\pm 0\.026to0\.356±0\.0370\.356\\pm 0\.037\. Thehybrid\\mathrm\{hybrid\}variant reaches0\.349±0\.0520\.349\\pm 0\.052\. Importantly, deployment\-relevant test accuracy is roughly constant across reward variants\. The five rows span0\.7430\.743to0\.7550\.755, all within their mutual confidence intervals and all below the strongest single base model reference of0\.762±0\.0240\.762\\pm 0\.024\. The choice of reward shaping rearranges which samples the DQN gets right at the action level without materially shifting the deployment metric\.

This decomposition tightens the Stage\-2 reading in[Section6\.2](https://arxiv.org/html/2606.04161#S6.SS2)\. The original logic \(“BC and DQN fail by similar margins, indicating a shared bottleneck above both algorithms”\) was strained when oracle agreements differed by0\.5080\.508\(BC\) vs\.0\.1580\.158\(logprob DQN\)\. With the reward sweep, the shared\-failure claim is supported on the test\-accuracy axis, where every reward and every comparator lands near0\.750\.75, below the0\.7620\.762static reference\. It is qualified on the oracle\-agreement axis, where reward shaping accounts for most of the original BC–DQN gap, and the body’soracle\_match\\mathrm\{oracle\\\_match\}row at0\.356±0\.0380\.356\\pm 0\.038already lifts the comparison above random\. The representational diagnosis still holds, and the algorithmic\-versus\-representational separation is now sharper because the reward\-shaping confound is identified\.

## Appendix DFull 16\-Window Aggregated Results

[Table6](https://arxiv.org/html/2606.04161#A4.T6)gives the full 16\-window summary underlying the body claims about local consistency, behavioral cloning, and disagreement augmentation\.

Table 6:Full 16\-window summary of localkk\-NN consistency, BC oracle agreement, BC held\-out accuracy, and the best single\-model reference, with and without disagreement augmentation\.
## Appendix Ekk\-NN Representation Robustness

[Section4\.1](https://arxiv.org/html/2606.04161#S4.SS1)reportskk\-NN consistency under standardized Euclidean distance on the fulld=38d\{=\}38state\. Withnbuffer=800n\_\{\\text\{buffer\}\}=800andk=10k\{=\}10, that geometry sits near the regime where neighborhood relationships can become metric\-fragile\. To check whether the local\-ambiguity diagnosis depends on the metric or dimensionality, we recomputekk\-NN consistencyc¯\\bar\{c\}and the held\-out1010\-NN selector under three alternative representations of the same buffer state: the55\-dimensional base\-model probability subvector alone \(probs\-only\), a PCA projection to ten components fit on the buffer \(pca10\), and Mahalanobis distance using the buffer covariance with a small ridge \(mahalanobis\)\. All projections fit on the buffer only and are applied to held\-out points without leakage\.

Across all four representations,c¯\\bar\{c\}stays in\[0\.38,0\.43\]\[0\.38,0\.43\]and the held\-out1010\-NN test accuracy stays in\[0\.74,0\.76\]\[0\.74,0\.76\]\. The probability\-subspace and Mahalanobis variants raise consistency by roughly five points relative to the full Euclidean baseline, but no representation breaks above0\.50\.5on either consistency or oracle agreement, and none beats the static reference of0\.7620\.762on test accuracy\. The Stage\-1 diagnosis \(that nearby states frequently disagree on the oracle action\) is therefore not a metric artifact\. It is robust to dimensionality reduction, subspace selection, and metric choice\. These numbers are computed on the augmentedd=51d\{=\}51state used in[Section4\.3](https://arxiv.org/html/2606.04161#S4.SS3)\. On the unaugmentedd=38d\{=\}38state, the full variant matches the body’s headline0\.388±0\.0100\.388\\pm 0\.010\.

## Appendix FDistribution\-Shift Tables

[Table7](https://arxiv.org/html/2606.04161#A6.T7)reports the full buffer\-to\-test oracle\-shift summary used for[Figure5](https://arxiv.org/html/2606.04161#S6.F5), including the controlled classifier comparison to the buffer oracle\.

Table 7:Full 16\-window buffer\-to\-test oracle\-shift summary\. The “clf→\\tobuffer” column is the controlled classifier’s difference in distance to the buffer oracle versus the test oracle\. Positive values mean the classifier remains closer to the buffer than to deployment\.
## Appendix GExternal\-Stratifier Robustness

The slices below recompute the main specialization analysis using stratifiers that do not depend on the base\-model probabilities themselves\. The compact summary in[Table8](https://arxiv.org/html/2606.04161#A7.T8)distills the full exported pairwise analysis into the comparisons most relevant to the workshop submission\.

Table 8:Compact summary of the external\-stratifier robustness analysis\. External stratifiers correlate moderately with each other, but their agreement with the original probability\-derived risk\-bin construction is much weaker on the main configuration\.At\(14d,14d\)\(14\\text\{d\},14\\text\{d\}\), the strongest agreement is between the three external stratifiers, not between any external stratifier and the original risk\-bin partition\. We therefore treat the specialization story as fragile rather than as a main\-paper result\.

### G\.1Full 16\-window correlation matrix

[Table9](https://arxiv.org/html/2606.04161#A7.T9)reports the per\-window correlation matrix underlying the aggregate robustness summary\.

Table 9:Full 16\-window correlation matrix for the external\-stratifier robustness analysis\. ‘cb’ denotes course baserate, ‘rb’ denotes the probability\-derived risk bin, and ‘k3/k5’ denote the two clustering\-based stratifiers\. The main\(14d,14d\)\(14\\text\{d\},14\\text\{d\}\)setting is unusual in that all three correlations involving the original risk\-bin partition are low\.
### G\.2Main\-configuration slice winners

To make that fragility more concrete,[Table10](https://arxiv.org/html/2606.04161#A7.T10)reports which base model is favored by the oracle within each external slice of the\(14d,14d\)\(14\\text\{d\},14\\text\{d\}\)configuration, and which single base model actually attains the best held\-out accuracy there\.

Table 10:Main\-configuration slice winners for the external\-stratifier analysis\. Several slices change which model looks best once the strata are defined independently of the original probability\-derived risk bins, which is why the specialization claim should be treated cautiously\.

## Appendix HTraining and Uncertainty Details

All main\-paper numbers are means and standard deviations over 5 random seeds\. Each seed resamples1,0001\{,\}000student\-course pairs, regenerates the stratified800/200800/200train\-test split, and rebuilds the 4\-fold CV buffer\.

The behavioral\-cloning model is a two\-layer MLP \(d→64→5d\\rightarrow 64\\rightarrow 5\) trained with hard\-label cross\-entropy for 30 epochs, batch size 64, dropout 0\.2, and weight decay10−410^\{\-4\}\. The main\-paper DQN is a one\-step contextual\-bandit reduction withγ=0\\gamma=0, architecture \(d→128→64→5d\\rightarrow 128\\rightarrow 64\\rightarrow 5\), soft target updates withτ=0\.05\\tau=0\.05, and 50 training epochs using the canonical\{0,1\}\\\{0,1\\\}oracle\-match reward defined in[Section4](https://arxiv.org/html/2606.04161#S4)\. The log\-probability shaping variant used in earlier drafts is reported as a reward\-sensitivity row in Appendix[C](https://arxiv.org/html/2606.04161#A3)\. The probabilities\-only and disagreement\-augmented variants use the same BC objective with the corresponding input state\.

Paired uncertainty intervals in the body are bootstrap intervals computed over the 5\-seed per\-configuration summaries\. We use those intervals only for the comparisons that materially affect the paper’s claims: BC\-full versus best static accuracy, probabilities\-only BC versus BC\-full accuracy, and disagreement\-augmented BC versus BC\-full accuracy\.
When Offline Selectors Cannot Beat the Best Single Model: A Diagnostic Study on edX Dropout Prediction

Similar Articles

Beyond Penalization: Diffusion-based Out-of-Distribution Detection and Selective Regularization in Offline Reinforcement Learning

Distribution Corrected Offline Data Distillation for Large Language Models

Adaptive data selection improves wearable prediction under low baseline performance

Interpreting Learning Under Competing Models: Joint and Stepwise Approaches for Dynamic Cognitive Diagnosis

The Long-Term Effects of Data Selection in LLM Fine-Tuning

Submit Feedback

Similar Articles

Beyond Penalization: Diffusion-based Out-of-Distribution Detection and Selective Regularization in Offline Reinforcement Learning
Distribution Corrected Offline Data Distillation for Large Language Models
Adaptive data selection improves wearable prediction under low baseline performance
Interpreting Learning Under Competing Models: Joint and Stepwise Approaches for Dynamic Cognitive Diagnosis
The Long-Term Effects of Data Selection in LLM Fine-Tuning