Recovering Stranded Discrimination in Knowledge Tracing: Per-Item Bias Correction via Empirical-Bayes Shrinkage

arXiv cs.LG Papers

Summary

This paper introduces SLC (State-space Logit Correction), which corrects per-item logit bias in knowledge tracing models using empirical-Bayes shrinkage via a Kalman smoother, improving AUC beyond global calibration techniques.

arXiv:2606.14123v1 Announce Type: new Abstract: Deployed knowledge-tracing models are typically frozen after training, yet systematic per-item logit bias arises, from limited per-item expressivity in backbone architectures and from post-deployment shifts in item properties, degrading prediction quality. Global post-hoc calibrators such as Platt scaling, temperature scaling, and isotonic regression improve probability estimates but leave discriminative ability, as measured by AUC, unchanged. This AUC invariance is a structural consequence of monotone score-only transforms; recovering the stranded discrimination requires conditioning on item identity. We propose SLC (State-space Logit Correction), which converts binary observations to Gaussian pseudo-observations via Laplace/IRLS, applies empirical-Bayes shrinkage through a Kalman smoother, and fits an offset-Platt link. The state-space formulation also yields a detectability bound that characterizes the Bernoulli information floor, explaining why temporal tracking provides no benefit at current data densities. Across four datasets, five backbones, and three seeds, SLC improves AUC on all four datasets and NLL on three, with the advantage concentrating on sparse items. Cross-domain controls suggest that the same phenomenon can arise beyond education when the deployed backbone leaves entity-level bias.
Original Article
View Cached Full Text

Cached at: 06/15/26, 09:10 AM

# Recovering Stranded Discrimination in Knowledge Tracing: Per-Item Bias Correction via Empirical-Bayes Shrinkage
Source: [https://arxiv.org/html/2606.14123](https://arxiv.org/html/2606.14123)
Xiaoran Yan, Cheng Tang, Atsushi Shimada

11institutetext:Kyushu University, Fukuoka, Japan
11email:xiaoran\.y@outlook\.com,11email:tang@limu\.ait\.kyushu\-u\.ac\.jp,
11email:atsushi@ait\.kyushu\-u\.ac\.jp###### Abstract

Deployed knowledge\-tracing models are typically frozen after training, yet systematic per\-item logit bias arises—from limited per\-item expressivity in backbone architectures and from post\-deployment shifts in item properties—degrading prediction quality\. Global post\-hoc calibrators such as Platt scaling, temperature scaling, and isotonic regression improve probability estimates but leave discriminative ability, as measured by AUC, unchanged\. This AUC invariance is a structural consequence of monotone score\-only transforms; recovering the stranded discrimination requires conditioning on item identity\. We proposeSLC\(State\-space Logit Correction\), which converts binary observations to Gaussian pseudo\-observations via Laplace/IRLS, applies empirical\-Bayes shrinkage through a Kalman smoother, and fits an offset\-Platt link\. The state\-space formulation also yields a detectability bound that characterizes the Bernoulli information floor, explaining why temporal tracking provides no benefit at current data densities\. Across four datasets, five backbones, and three seeds,SLCimproves AUC on all four datasets and NLL on three, with the advantage concentrating on sparse items\. Cross\-domain controls suggest that the same phenomenon can arise beyond education when the deployed backbone leaves entity\-level bias\.

## 1Introduction

Knowledge\-tracing \(KT\) models\[[3](https://arxiv.org/html/2606.14123#bib.bib3)\]estimate the probability that a student will answer an item correctly; these probabilities drive adaptive item selection, mastery gating, and early\-warning systems\. Per\-item logit bias arises from two sources: backbone architectures with limited per\-item expressivity produce structural prediction errors, and post\-deployment shifts in item properties\[[19](https://arxiv.org/html/2606.14123#bib.bib19),[20](https://arxiv.org/html/2606.14123#bib.bib20)\]—difficulty changes, new items, population evolution—further compound them\.

The standard response is post\-hoc calibration: Platt scaling\[[25](https://arxiv.org/html/2606.14123#bib.bib25)\], temperature scaling\[[13](https://arxiv.org/html/2606.14123#bib.bib13)\], or isotonic regression\[[35](https://arxiv.org/html/2606.14123#bib.bib35)\]\. These methods improve probability estimates, yet the model’s*discriminative*ability—its AUC—remains unchanged because global score\-only transforms adjust the*scale*of predictions but not their*ordering*\. We draw a sharp distinction:calibrationadjusts probability scale \(AUC\-invariant\);correctionrecovers ranking quality \(AUC\-improving\)\. This paper addresses correction\.

Figure[1](https://arxiv.org/html/2606.14123#S1.F1)illustrates this on ASSISTments 2017 \(temporal split, 5 backbones\): raw and Platt\-scaled AUC are identical at every time slice, yet per\-item residual correction \(ResCal\) recovers\+2\.6​pp\+2\.6\\,\\text\{pp\}of hidden headroom \(Eedi:\+3\.5​pp\+3\.5\\,\\text\{pp\}\);SLCfurther improves on this via shrinkage\.

![Refer to caption](https://arxiv.org/html/2606.14123v1/x1.png)Figure 1:ASSISTments 2017, temporal split, 5 backbones averaged\.Left:AUC over time for the raw backbone \(Base\) and Platt scaling \(Platt\)—the curves are identical, confirming AUC invariance of global calibration\.Right:per\-item residual correction \(ResCal\) recovers\+2\.6​pp\+2\.6\\,\\text\{pp\}of hidden AUC headroom that global calibration structurally cannot access\.This behavior is structural: Lemma[1](https://arxiv.org/html/2606.14123#Thmlemma1)shows that any strictly monotone, score\-only transformation preserves rankings and hence AUC; recovering headroom therefore requires conditioning on item identity\. We proposeSLC\(State\-space Logit Correction\)111Code:[https://github\.com/xiaoran\-y/SLC](https://github.com/xiaoran-y/SLC), which models per\-item bias as a Gaussian random effect\[[6](https://arxiv.org/html/2606.14123#bib.bib6)\], converts binary observations to Gaussian pseudo\-observations via Laplace/IRLS, and pools them with a Kalman smoother\[[4](https://arxiv.org/html/2606.14123#bib.bib4),[7](https://arxiv.org/html/2606.14123#bib.bib7)\]\. The state\-space formulation also yields a detectability bound \(Proposition[2](https://arxiv.org/html/2606.14123#Thmproposition2)\) that quantifies when temporal tracking becomes viable\. The corrected prediction takes the form

p=σ⁡\(a​η0\+b0\+b^i\),p=\\operatorname\{\\sigma\}\\\!\\big\(a\\,\\eta\_\{0\}\+b\_\{0\}\+\\hat\{b\}\_\{i\}\\big\),\(1\)whereη0=logit⁡\(p0\)\\eta\_\{0\}=\\operatorname\{logit\}\(p\_\{0\}\)is the frozen backbone logit and\(a,b0\)\(a,b\_\{0\}\)are global affine parameters\. The framework extends to temporal trackingβ​\(i,t\)=bi\+ui​\(t\)\\beta\(i,t\)=b\_\{i\}\+u\_\{i\}\(t\), but Proposition[2](https://arxiv.org/html/2606.14123#Thmproposition2)shows that the minimum detectable drift far exceeds observed temporal variation at current KT data densities\.

Our contributions are:

1. 1\.Stranded discrimination\.We identify per\-item “stranded” AUC headroom in deployed KT models\. AUC invariance of monotone score\-only transforms \(Lemma[1](https://arxiv.org/html/2606.14123#Thmlemma1)\) serves as a diagnostic; a five\-level baseline ladder confirms that neither score\-only nor time\-only conditioning recovers this headroom across all 20 configurations\.
2. 2\.Per\-item shrinkage pipeline\.We proposeSLC: Laplace/IRLS pseudo\-observations, empirical\-Bayes shrinkage via Kalman smoothing, and an offset\-Platt link\. The additive per\-item form is theoretically motivated \(Proposition[1](https://arxiv.org/html/2606.14123#Thmproposition1)\); a detectability bound \(Proposition[2](https://arxiv.org/html/2606.14123#Thmproposition2)\) explains why temporal tracking is information\-limited at current densities and predicts the viability threshold \(on the order of10510^\{5\}obs/item\)\.
3. 3\.Comprehensive evaluation\.Four KT datasets, five backbones, three seeds; density\-stratified analysis, calibration\-fraction sweep, synthetic regime map, and cross\-domain controls including a non\-KT flight\-delay experiment\.

## 2Related Work

### 2\.1Knowledge\-Tracing Models

DKT\[[24](https://arxiv.org/html/2606.14123#bib.bib24)\]applies recurrent networks; SAKT\[[23](https://arxiv.org/html/2606.14123#bib.bib23)\]and AKT\[[11](https://arxiv.org/html/2606.14123#bib.bib11)\]use self\-attention; DKVMN\[[36](https://arxiv.org/html/2606.14123#bib.bib36)\]augments memory networks; LPKT\[[26](https://arxiv.org/html/2606.14123#bib.bib26)\]models the learning process explicitly\. These architectures differ in per\-item expressivity: DKT, SAKT, and DKVMN operate at the skill level and share representations across items within a skill, while AKT and LPKT include per\-item parameters\. Even with per\-item modeling, frozen backbones accumulate residual per\-item bias after deployment\.SLCis post\-hoc and backbone\-agnostic: it corrects this residual bias from any frozen model’s logits without retraining\.

### 2\.2Post\-Hoc Calibration

Post\-hoc calibration adjusts predicted probabilities to match observed frequencies\. Platt scaling\[[25](https://arxiv.org/html/2606.14123#bib.bib25)\], temperature scaling\[[13](https://arxiv.org/html/2606.14123#bib.bib13)\], isotonic regression\[[35](https://arxiv.org/html/2606.14123#bib.bib35)\], and histogram binning\[[34](https://arxiv.org/html/2606.14123#bib.bib34)\]are all score\-only transforms; strictly monotone variants leave AUC invariant \(Lemma[1](https://arxiv.org/html/2606.14123#Thmlemma1)\)\. ECE is not a proper scoring rule\[[12](https://arxiv.org/html/2606.14123#bib.bib12)\]; we treat NLL as a co\-primary metric and ECE as a diagnostic\.

### 2\.3Per\-Group and Per\-Instance Calibration

Several works condition calibration on input features, including class\-wise\[[10](https://arxiv.org/html/2606.14123#bib.bib10)\]and parameterized\[[28](https://arxiv.org/html/2606.14123#bib.bib28)\]temperature scaling, density\-aware calibration\[[29](https://arxiv.org/html/2606.14123#bib.bib29)\], and field\-aware calibrators\[[22](https://arxiv.org/html/2606.14123#bib.bib22)\]\.SLCoperates in a metadata\-only regime \(item id \+ time index, no learned embeddings\)\. Class\-wise scaling withK≫1K\{\\gg\}1categories reduces to unregularized per\-item estimation \(ourNaivebaseline\);SLCadds shrinkage\. In the static limit,SLCreduces to ridge logistic regression with per\-item intercepts\[[2](https://arxiv.org/html/2606.14123#bib.bib2)\]\. Logit adjustment\[[21](https://arxiv.org/html/2606.14123#bib.bib21)\]shares the per\-class offset idea but targets class imbalance\.

### 2\.4Temporal Adaptation and State\-Space Models

Test\-time adaptation \(e\.g\., Tent\[[31](https://arxiv.org/html/2606.14123#bib.bib31)\]\) modifies model parameters online, while dynamic IRT models\[[32](https://arxiv.org/html/2606.14123#bib.bib32),[17](https://arxiv.org/html/2606.14123#bib.bib17)\]jointly re\-estimate ability and difficulty; both require either model access or full re\-estimation\.SLCinstead treats the backbone as frozen and adopts the Laplace/IRLS \+ Kalman inference techniques developed for state\-space GLMMs\[[4](https://arxiv.org/html/2606.14123#bib.bib4),[7](https://arxiv.org/html/2606.14123#bib.bib7),[2](https://arxiv.org/html/2606.14123#bib.bib2)\]as a post\-hoc per\-item correction algorithm, with the resulting shrinkage paralleling James–Stein estimation\[[6](https://arxiv.org/html/2606.14123#bib.bib6),[5](https://arxiv.org/html/2606.14123#bib.bib5)\]\.

## 3Method

### 3\.1Problem Setting

A frozen KT backbone produces logitsη0​\(x\)=logit⁡\(p0​\(x\)\)\\eta\_\{0\}\(x\)=\\operatorname\{logit\}\(p\_\{0\}\(x\)\)for each interactionx=\(s,i,t\)x=\(s,i,t\)\(studentss, itemii, time indextt\)\. Data is partitioned temporally: train→\\tocalibration→\\totest \(strictly later\), exposing genuine drift\. The post\-hoc correction usesη0\\eta\_\{0\}, labelsy∈\{0,1\}y\\in\\\{0,1\\\}, item IDs, and time indices from the calibration window only—no test labels, no backbone parameter updates\.

### 3\.2AUC Invariance of Score\-Only Calibration

The central structural observation is that global calibration is inherently unable to improve AUC\. This invariance is a classical fact\[[14](https://arxiv.org/html/2606.14123#bib.bib14),[8](https://arxiv.org/html/2606.14123#bib.bib8)\]; we restate it because it serves as the diagnostic for stranded headroom:

###### Lemma 1\(AUC invariance\)

Lets​\(x\)∈ℝs\(x\)\\in\\mathbb\{R\}be a scalar score andϕ:ℝ→ℝ\\phi:\\mathbb\{R\}\\to\\mathbb\{R\}be strictly increasing\. ThenAUC⁡\(ϕ​\(s\)\)=AUC⁡\(s\)\\operatorname\{AUC\}\(\\phi\(s\)\)=\\operatorname\{AUC\}\(s\)\.

###### Proof

AUC equals the probability that a randomly drawn positive receives a higher score than a randomly drawn negative\[[14](https://arxiv.org/html/2606.14123#bib.bib14)\]\. Sinceϕ\\phiis strictly increasing,s​\(x\+\)\>s​\(x−\)s\(x\_\{\+\}\)\>s\(x\_\{\-\}\)iffϕ​\(s​\(x\+\)\)\>ϕ​\(s​\(x−\)\)\\phi\(s\(x\_\{\+\}\)\)\>\\phi\(s\(x\_\{\-\}\)\)\. All pairwise orderings are preserved, and AUC—which depends only on these orderings—remains unchanged \(see also\[[8](https://arxiv.org/html/2606.14123#bib.bib8)\]\)\.

Platt scaling \(a\>0a\>0\) and temperature scaling \(T\>0T\>0\) satisfy strict monotonicity and are exactly AUC\-invariant\. Isotonic regression and histogram binning are non\-decreasing but piecewise constant, so they fall outside Lemma[1](https://arxiv.org/html/2606.14123#Thmlemma1)and can in principle change AUC through ties; in practice the effect is negligible with continuous logits\.Platt\-T\(score \+ time, no item id\) changes AUC only marginally in our experiments: without item identity, per\-item heterogeneity cannot be resolved\. The implication:*recovering AUC headroom requires conditioning on at least the item identity\.*

### 3\.3Per\-Item Correction Form

The following proposition is a standard conditional\-mean projection argument; we state it to fix the form of the correction\.

###### Proposition 1

Letη∗​\(s,i,t\)\\eta^\{\*\}\(s,i,t\)denote the Bayes\-optimal logit for studentss, itemii, timett, and letη0​\(s,i,t\)\\eta\_\{0\}\(s,i,t\)be the frozen backbone logit\. Define the item\-specific biasβ​\(i\)=𝔼​\[η∗−η0∣i\]\\beta\(i\)=\\mathbb\{E\}\[\\eta^\{\*\}\-\\eta\_\{0\}\\mid i\]and the residual

ϵ​\(s,i,t\)=η∗​\(s,i,t\)−η0​\(s,i,t\)−β​\(i\),\\epsilon\(s,i,t\)=\\eta^\{\*\}\(s,i,t\)\-\\eta\_\{0\}\(s,i,t\)\-\\beta\(i\),\(2\)so that𝔼​\[ϵ∣i\]=0\\mathbb\{E\}\[\\epsilon\\mid i\]=0by construction\. Then the MSE\-optimal additive correction ofη0\\eta\_\{0\}that depends only on item identity is

η^​\(s,i,t\)=η0​\(s,i,t\)\+β​\(i\)\.\\hat\{\\eta\}\(s,i,t\)=\\eta\_\{0\}\(s,i,t\)\+\\beta\(i\)\.

###### Proof

Among all corrections of the formη0\+f​\(i\)\\eta\_\{0\}\+f\(i\), the MSE𝔼​\[\(η∗−η0−f​\(i\)\)2\]\\mathbb\{E\}\[\(\\eta^\{\*\}\-\\eta\_\{0\}\-f\(i\)\)^\{2\}\]is minimized byf​\(i\)=𝔼​\[η∗−η0∣i\]=β​\(i\)f\(i\)=\\mathbb\{E\}\[\\eta^\{\*\}\-\\eta\_\{0\}\\mid i\]=\\beta\(i\)\.

Becauseβ​\(i\)\\beta\(i\)is defined as the conditional mean,𝔼​\[ϵ∣i\]=0\\mathbb\{E\}\[\\epsilon\\mid i\]=0holds by construction\. How much AUC headroom per\-item correction can recover depends onVar​\(β​\(i\)\)\\mathrm\{Var\}\(\\beta\(i\)\), the between\-item component of the backbone error, which we evaluate empirically in Section[4](https://arxiv.org/html/2606.14123#S4)\.

A standard 1PL IRT model of item difficulty drift provides an idealized setting to verify the correction form\. Extending the conditioning to both item and time \(β​\(i,t\)=𝔼​\[η∗−η0∣i,t\]\\beta\(i,t\)\{=\}\\mathbb\{E\}\[\\eta^\{\*\}\{\-\}\\eta\_\{0\}\\mid i,t\]\), the residual vanishes entirely:

###### Corollary 1\(1PL exactness\)

LetY∼Ber⁡\(σ⁡\(θs−bi​\(t\)\)\)Y\\sim\\operatorname\{Ber\}\(\\operatorname\{\\sigma\}\(\\theta\_\{s\}\-b\_\{i\}\(t\)\)\)withbi​\(t\)=bitrain\+Δi​\(t\)b\_\{i\}\(t\)=b\_\{i\}^\{\\mathrm\{train\}\}\+\\Delta\_\{i\}\(t\), and let the frozen backbone produceη0​\(s,i\)=α​\(θs−bitrain\)\+c0\\eta\_\{0\}\(s,i\)=\\alpha\(\\theta\_\{s\}\-b\_\{i\}^\{\\mathrm\{train\}\}\)\+c\_\{0\}for some backbone scale factorα\>0\\alpha\>0and shiftc0c\_\{0\}\(distinct from the residual term in Proposition[1](https://arxiv.org/html/2606.14123#Thmproposition1)\)\. Then the Bayes\-optimal logitη∗=θs−bi​\(t\)\\eta^\{\*\}=\\theta\_\{s\}\-b\_\{i\}\(t\)isexactlyrecovered by the offset\-Platt \+ per\-item correction:

η∗=a​η0\+b\+β​\(i,t\),a=1/α,b=−c0/α,β​\(i,t\)=−Δi​\(t\)\.\\eta^\{\*\}=a\\,\\eta\_\{0\}\+b\+\\beta\(i,t\),\\qquad a=1/\\alpha,\\;\\;b=\-c\_\{0\}/\\alpha,\\;\\;\\beta\(i,t\)=\-\\Delta\_\{i\}\(t\)\.The residual MSE is zero\.

###### Proof

η∗=θs−bitrain−Δi​\(t\)=η0−c0α−Δi​\(t\)=1α​η0−c0α−Δi​\(t\)=a​η0\+b\+β​\(i,t\)\.\\eta^\{\*\}=\\theta\_\{s\}\-b\_\{i\}^\{\\mathrm\{train\}\}\-\\Delta\_\{i\}\(t\)=\\frac\{\\eta\_\{0\}\-c\_\{0\}\}\{\\alpha\}\-\\Delta\_\{i\}\(t\)=\\frac\{1\}\{\\alpha\}\\,\\eta\_\{0\}\-\\frac\{c\_\{0\}\}\{\\alpha\}\-\\Delta\_\{i\}\(t\)=a\\,\\eta\_\{0\}\+b\+\\beta\(i,t\)\.

Under 1PL drift, the exact correctionβ​\(i,t\)\\beta\(i,t\)is student\-free and additive \(not scaled byaa\); the staticb^i\\hat\{b\}\_\{i\}averages over time\. Synthetic experiments \(Section[4\.5](https://arxiv.org/html/2606.14123#S4.SS5)\) verify robustness beyond the 1PL assumption\.

### 3\.4Model Specification

We formalize the per\-item bias as a Gaussian random effect within a generalized linear mixed model \(GLMM\):

y∼Ber⁡\(σ⁡\(a​η0\+b0\+bi\)\),bi∼𝒩​\(0,σb2\)\.y\\sim\\operatorname\{Ber\}\\\!\\big\(\\operatorname\{\\sigma\}\(a\\,\\eta\_\{0\}\+b\_\{0\}\+b\_\{i\}\)\\big\),\\qquad b\_\{i\}\\sim\\mathcal\{N\}\(0,\\sigma\_\{b\}^\{2\}\)\.\(3\)
Here\(a,b0\)\(a,b\_\{0\}\)absorb global scale/shift distortion,bib\_\{i\}captures per\-item logit shift, and𝒩​\(0,σb2\)\\mathcal\{N\}\(0,\\sigma\_\{b\}^\{2\}\)provides shrinkage\. Critically,bib\_\{i\}enters as an additive offset, not scaled byaa, matching Corollary[1](https://arxiv.org/html/2606.14123#Thmcorollary1); the alternative formσ⁡\(a​\(η0\+b^i\)\+b0\)\\operatorname\{\\sigma\}\(a\(\\eta\_\{0\}\+\\hat\{b\}\_\{i\}\)\+b\_\{0\}\)performs worse empirically \(Section[4\.3](https://arxiv.org/html/2606.14123#S4.SS3)\)\.SLCfits this parameterization in two stages:\{bi\}\\\{b\_\{i\}\\\}are first estimated under the working modelp=σ​\(η0\+bi\)p\{=\}\\sigma\(\\eta\_\{0\}\{\+\}b\_\{i\}\)with implicit\(a,b0\)=\(1,0\)\(a,b\_\{0\}\)\{=\}\(1,0\);\(a,b0\)\(a,b\_\{0\}\)are then fit treating theb^i\\hat\{b\}\_\{i\}as fixed offsets \(Section[3\.6](https://arxiv.org/html/2606.14123#S3.SS6)\)\. Sincebib\_\{i\}is additive and independent ofaa\(Corollary[1](https://arxiv.org/html/2606.14123#Thmcorollary1)\), the two stages do not interfere; this separation also preserves the diagonal Hessian that enablesO​\(N\+K\)O\(N\{\+\}K\)per\-item estimation \(appendix\)\.

The GLMM also admits a state\-space extension:β​\(i,t\)=bi\+ui​\(t\)\\beta\(i,t\)=b\_\{i\}\+u\_\{i\}\(t\),ui​\(t\)=ui​\(t−1\)\+εtu\_\{i\}\(t\)=u\_\{i\}\(t\{\-\}1\)\+\\varepsilon\_\{t\},εt∼𝒩​\(0,σu2\)\\varepsilon\_\{t\}\\sim\\mathcal\{N\}\(0,\\sigma\_\{u\}^\{2\}\)\. However, temporal tracking does not improve AUC or NLL on our datasets \(Section[4\.3](https://arxiv.org/html/2606.14123#S4.SS3), Proposition[2](https://arxiv.org/html/2606.14123#Thmproposition2)\); the defaultSLCuses staticbib\_\{i\}\.

### 3\.5Estimation via Kalman Smoother

The binary likelihood does not admit conjugate Kalman updates, so we linearize it with the standard Laplace approximation\[[4](https://arxiv.org/html/2606.14123#bib.bib4)\]\. Time bins are equal\-count \(quantile\) partitions of cumulative interaction indices from the train and calibration windows; for the static model the final estimate pools across all bins and is invariant to the bin definition\. For each \(item, time\-bin\) cell\(i,t\)\(i,t\), given a current estimateb^i\(prev\)\\hat\{b\}\_\{i\}^\{\(\\text\{prev\}\)\}, we compute predicted probabilitiespn=σ⁡\(η0​\(xn\)\+b^i\(prev\)\)p\_\{n\}=\\operatorname\{\\sigma\}\(\\eta\_\{0\}\(x\_\{n\}\)\+\\hat\{b\}\_\{i\}^\{\(\\text\{prev\}\)\}\)for each observationnnin the cell and then form

Wi,t\\displaystyle W\_\{i,t\}=∑n∈\(i,t\)pn​\(1−pn\),\\displaystyle=\\sum\_\{n\\in\(i,t\)\}p\_\{n\}\(1\-p\_\{n\}\),\(4\)zi,t\\displaystyle z\_\{i,t\}=b^i\(prev\)\+∑n∈\(i,t\)\(yn−pn\)Wi,t\.\\displaystyle=\\hat\{b\}\_\{i\}^\{\(\\text\{prev\}\)\}\+\\frac\{\\sum\_\{n\\in\(i,t\)\}\(y\_\{n\}\-p\_\{n\}\)\}\{W\_\{i,t\}\}\.\(5\)These yield approximate Gaussian observationszi,t∼𝒩​\(bi,1/Wi,t\)z\_\{i,t\}\\sim\\mathcal\{N\}\(b\_\{i\},1/W\_\{i,t\}\), whereWi,tW\_\{i,t\}is the Fisher information weight \(effective sample size\) andzi,tz\_\{i,t\}the pseudo\-residual, bridging the binary likelihood to a Gaussian state\-space model\. In the static case, the Kalman smoother reduces to a weighted\-average shrinkage estimator:

b^i=∑tWi,t​zi,t1/σb2\+∑tWi,t\.\\hat\{b\}\_\{i\}=\\frac\{\\sum\_\{t\}W\_\{i,t\}\\,z\_\{i,t\}\}\{1/\\sigma\_\{b\}^\{2\}\+\\sum\_\{t\}W\_\{i,t\}\}\.\(6\)The shrinkage fractionλi=∑tWi,t/\(1/σb2\+∑tWi,t\)\\lambda\_\{i\}=\\sum\_\{t\}W\_\{i,t\}/\(1/\\sigma\_\{b\}^\{2\}\+\\sum\_\{t\}W\_\{i,t\}\)ranges from near 0 for sparse items \(estimate stays close to the prior mean\) to near 1 for dense items \(estimate tracks the empirical residual\)\. This is an empirical\-Bayes shrinkage estimator\[[5](https://arxiv.org/html/2606.14123#bib.bib5)\]that parallels James–Stein shrinkage\[[6](https://arxiv.org/html/2606.14123#bib.bib6)\]: under the Laplace–normal proxy the estimator dominates the naive per\-item mean in MSE wheneverK≥3K\\geq 3; in the original Bernoulli model the dominance is approximate, but the empirical gains on sparse items are substantial \(Section[4\.3](https://arxiv.org/html/2606.14123#S4.SS3)\)\. The prior varianceσb2=1\.0\\sigma\_\{b\}^\{2\}=1\.0is fixed throughout; the estimator is insensitive to this choice becauseλi\\lambda\_\{i\}saturates rapidly forWi≫1/σb2W\_\{i\}\\gg 1/\\sigma\_\{b\}^\{2\}\(sweepingσb2∈\[0\.1,10\]\\sigma\_\{b\}^\{2\}\\in\[0\.1,10\]on AS17 changes AUC by<0\.4​pp<0\.4\\,\\text\{pp\}\)\. In the static case, Eq\. \([6](https://arxiv.org/html/2606.14123#S3.E6)\) is the MAP estimate ofℓ2\\ell\_\{2\}\-penalized logistic regression with per\-item intercepts and the backbone logit as offset; the Hessian is diagonal because each observation involves exactly one item, so the Newton step decomposes intoKKindependent scalar updates \(appendix\)\. Our deployedSLCuses a single step frombi\(0\)=0b\_\{i\}^\{\(0\)\}\{=\}0\(Algorithm[1](https://arxiv.org/html/2606.14123#alg1)\); the offset\-Platt fit \(Section[3\.6](https://arxiv.org/html/2606.14123#S3.SS6)\) absorbs residual global bias\.

Items with zero calibration observations receiveb^i=0\\hat\{b\}\_\{i\}=0\(the prior mean\), defaulting toPlatt; for sparse items,λi→0\\lambda\_\{i\}\\to 0pulls the estimate toward the prior without any numerical floor\. For the temporal extension, Rauch–Tung–Striebel smoothing applies at costO​\(K⋅T\)O\(K\\cdot T\)\.

### 3\.6Link Estimation

Given\{b^i\}\\\{\\hat\{b\}\_\{i\}\\\}, we fit\(a,b0\)\(a,b\_\{0\}\)treatingb^i\\hat\{b\}\_\{i\}as a fixed offset:

\(a∗,b0∗\)\\displaystyle\(a^\{\*\},b\_\{0\}^\{\*\}\)=arg⁡mina,b0​∑n−yn​log⁡pn−\(1−yn\)​log⁡\(1−pn\),\\displaystyle=\\arg\\min\_\{a,b\_\{0\}\}\\sum\_\{n\}\-y\_\{n\}\\log p\_\{n\}\-\(1\{\-\}y\_\{n\}\)\\log\(1\{\-\}p\_\{n\}\),pn\\displaystyle p\_\{n\}=σ⁡\(a​η0​\(xn\)\+b0\+b^i​\(n\)\)\.\\displaystyle=\\operatorname\{\\sigma\}\(a\\,\\eta\_\{0\}\(x\_\{n\}\)\+b\_\{0\}\+\\hat\{b\}\_\{i\(n\)\}\)\.This is a convex logistic regression with two parameters, solved by IRLS in a few iterations\. The offset\-Platt link correctly treatsb^i\\hat\{b\}\_\{i\}as a random effect that should not be rescaled, matching the GLMM parameterization \(Eq\. \([3](https://arxiv.org/html/2606.14123#S3.E3)\)\) and Corollary[1](https://arxiv.org/html/2606.14123#Thmcorollary1)\.

Alternatively, we can fit a monotone functionggvia isotonic regression on the corrected logitη~n=η0​\(xn\)\+b^i​\(n\)\\tilde\{\\eta\}\_\{n\}=\\eta\_\{0\}\(x\_\{n\}\)\+\\hat\{b\}\_\{i\(n\)\}\. This preserves rankings \(hence AUC\) but offers more flexibility than affine mapping, at the cost of higher variance on sparse data\. Offset\-Platt is the recommended default\.

### 3\.7Temporal Drift Detectability Bound

We explain this via a statistical detectability bound:

###### Proposition 2\(Detectability bound\)

Consider itemiiobserved acrossTTtime bins\. Letnt=\|𝒞i,t\|n\_\{t\}=\|\\mathcal\{C\}\_\{i,t\}\|denote the number of observations in bintt, and letβt\\beta\_\{t\}denote the true per\-item logit bias in bintt\. Define the drift incrementδt=βt−βt−1\\delta\_\{t\}=\\beta\_\{t\}\-\\beta\_\{t\-1\}\. Under the Laplace approximation \(Section[3\.5](https://arxiv.org/html/2606.14123#S3.SS5)\), the pseudo\-observation for binttsatisfies

zi,t∼𝒩​\(βt,1/Wi,t\),Wi,t=∑n∈𝒞i,tpn​\(1−pn\)\.z\_\{i,t\}\\;\\sim\\;\\mathcal\{N\}\(\\beta\_\{t\},\\;1/W\_\{i,t\}\),\\qquad W\_\{i,t\}=\\sum\_\{n\\in\\mathcal\{C\}\_\{i,t\}\}p\_\{n\}\(1\-p\_\{n\}\)\.1. \(i\)The Fisher weight is bounded:Wi,t≤nt/4W\_\{i,t\}\\leq n\_\{t\}/4, sincep​\(1−p\)≤1/4p\(1\-p\)\\leq 1/4for allp∈\[0,1\]p\\in\[0,1\]\.
2. \(ii\)Treating adjacent bins as conditionally independent under the same approximation, the variance of the differencezi,t−zi,t−1z\_\{i,t\}\-z\_\{i,t\-1\}\(which estimatesδt\\delta\_\{t\}\) is Var⁡\(zi,t−zi,t−1\)=1Wi,t\+1Wi,t−1≥4nt\+4nt−1≥8nmin,\\operatorname\{Var\}\(z\_\{i,t\}\-z\_\{i,t\-1\}\)=\\frac\{1\}\{W\_\{i,t\}\}\+\\frac\{1\}\{W\_\{i,t\-1\}\}\\;\\geq\\;\\frac\{4\}\{n\_\{t\}\}\+\\frac\{4\}\{n\_\{t\-1\}\}\\;\\geq\\;\\frac\{8\}\{n\_\{\\min\}\},wherenmin=min⁡\(nt,nt−1\)n\_\{\\min\}=\\min\(n\_\{t\},n\_\{t\-1\}\)\.
3. \(iii\)A Wald test at significance levelα\\alphadetectsδt≠0\\delta\_\{t\}\\neq 0only if \|δt\|≥δmin=zα​1Wi,t\+1Wi,t−1≥zα​8nmin\.\|\\delta\_\{t\}\|\\;\\geq\\;\\delta\_\{\\min\}\\;=\\;z\_\{\\alpha\}\\sqrt\{\\frac\{1\}\{W\_\{i,t\}\}\+\\frac\{1\}\{W\_\{i,t\-1\}\}\}\\;\\geq\\;z\_\{\\alpha\}\\sqrt\{\\frac\{8\}\{n\_\{\\min\}\}\}\.\(7\)

###### Proof

1. \(i\)Wi,t=∑npn​\(1−pn\)≤∑n14=nt/4W\_\{i,t\}=\\sum\_\{n\}p\_\{n\}\(1\{\-\}p\_\{n\}\)\\leq\\sum\_\{n\}\\tfrac\{1\}\{4\}=n\_\{t\}/4, with equality when allpn=1/2p\_\{n\}=1/2\.
2. \(ii\)Under this approximation, we treatzi,tz\_\{i,t\}andzi,t−1z\_\{i,t\-1\}as conditionally independent because they are computed from disjoint observation sets, soVar⁡\(zi,t−zi,t−1\)=1/Wi,t\+1/Wi,t−1\\operatorname\{Var\}\(z\_\{i,t\}\-z\_\{i,t\-1\}\)=1/W\_\{i,t\}\+1/W\_\{i,t\-1\}\. Applying \(i\) and then1/W≥4/n1/W\\geq 4/nyields the bound\.
3. \(iii\)The Wald statistic Z=\(zi,t−zi,t−1\)/1/Wi,t\+1/Wi,t−1Z=\(z\_\{i,t\}\-z\_\{i,t\-1\}\)\\big/\\\!\\sqrt\{1/W\_\{i,t\}\+1/W\_\{i,t\-1\}\}is𝒩​\(0,1\)\\mathcal\{N\}\(0,1\)underH0:δt=0H\_\{0\}\{:\}\\,\\delta\_\{t\}\{=\}0\. Rejection at levelα\\alpharequires\|Z\|≥zα\|Z\|\\geq z\_\{\\alpha\}, i\.e\.\|δt\|≥zα​1/Wi,t\+1/Wi,t−1\|\\delta\_\{t\}\|\\geq z\_\{\\alpha\}\\sqrt\{1/W\_\{i,t\}\+1/W\_\{i,t\-1\}\}\.

On ASSISTments 2017 \(strongest drift\), the median item has 48 observations across 10 bins \(∼4\.8\{\\sim\}4\.8/bin,Wmax≤1\.20W\_\{\\max\}\\leq 1\.20\)\. Atα=0\.05\\alpha=0\.05:δmin≈2\.53\\delta\_\{\\min\}\\approx 2\.53logit, i\.e\., an item’s correct rate would need to shift from 50% to 92% within a single bin for the change to be detectable\. Table[5](https://arxiv.org/html/2606.14123#S4.T5)reports all four datasets\. As an independent check, we compute the median observed adjacent\-bin logit change\|δ^obs\|\|\\hat\{\\delta\}\_\{\\text\{obs\}\}\|from empirical per\-item correct rates; this is an*upper bound*on true drift because it includes estimation noise\. On every dataset,\|δ^obs\|≪δmin\|\\hat\{\\delta\}\_\{\\text\{obs\}\}\|\\ll\\delta\_\{\\min\}, confirming that temporal tracking operates deep in the noise\-dominated regime\. ReducingTTlowersδmin\\delta\_\{\\min\}\(asT\\sqrt\{T\}\), but also concentrates the signal into fewer comparisons with higher per\-bin variance; at the extremeT=2T\{=\}2on AS17,δmin\\delta\_\{\\min\}drops to≈1\.13\{\\approx\}1\.13logit \(50%→\\to75% correct rate\), still implausible\. The Kalman smoother implicitly optimizes this resolution–variance trade\-off; its uniform loss to the static estimator in all 20 configurations \(Section[4\.3](https://arxiv.org/html/2606.14123#S4.SS3)\) confirms that no temporal granularity recovers useful signal at these densities\. Rearranging Eq\. \([7](https://arxiv.org/html/2606.14123#S3.E7)\) suggests a rough viability threshold on the order of10510^\{5\}total obs/item for a moderate drift ofδ=0\.05\\delta=0\.05logit/bin\.

### 3\.8Algorithm Summary

Algorithm[1](https://arxiv.org/html/2606.14123#alg1)summarizes the completeSLCpipeline\. Lines 2–6 convert the binary observations in each \(item, time\-bin\) cell into Gaussian pseudo\-observations\(zi,t,Wi,t\)\(z\_\{i,t\},W\_\{i,t\}\)via the Laplace approximation of Section[3\.5](https://arxiv.org/html/2606.14123#S3.SS5); lines 7–9 apply the empirical\-Bayes shrinkage of Eq\. \([6](https://arxiv.org/html/2606.14123#S3.E6)\) to obtain the per\-item offsetsb^i\\hat\{b\}\_\{i\}; line 10 fits the offset\-Platt link of Section[3\.6](https://arxiv.org/html/2606.14123#S3.SS6)\. The only hyperparameter is the prior varianceσb2\\sigma\_\{b\}^\{2\}, fixed to1\.01\.0throughout\.

Algorithm 1SLC\(State\-space Logit Correction\)0:Backbone logits

\{η0,n\}n=1N\\\{\\eta\_\{0,n\}\\\}\_\{n=1\}^\{N\}, calibration labels

\{yn\}\\\{y\_\{n\}\\\}, item IDs

\{in\}\\\{i\_\{n\}\\\}, time indices

\{tn\}\\\{t\_\{n\}\\\}, prior variance

σb2\\sigma\_\{b\}^\{2\}
0:Corrected probabilities

\{p^n\}\\\{\\hat\{p\}\_\{n\}\\\}for new observations

1:Partition calibration data into

\(i,t\)\(i,t\)cells:

𝒞i,t←\{n:in=i,tn=t\}\\mathcal\{C\}\_\{i,t\}\\leftarrow\\\{n:i\_\{n\}=i,\\,t\_\{n\}=t\\\}
2:foreach cell

\(i,t\)\(i,t\)with

\|𝒞i,t\|\>0\|\\mathcal\{C\}\_\{i,t\}\|\>0do

3:

pn←σ⁡\(η0,n\)p\_\{n\}\\leftarrow\\operatorname\{\\sigma\}\(\\eta\_\{0,n\}\)for

n∈𝒞i,tn\\in\\mathcal\{C\}\_\{i,t\}
4:

Wi,t←∑n∈𝒞i,tpn​\(1−pn\)W\_\{i,t\}\\leftarrow\\sum\_\{n\\in\\mathcal\{C\}\_\{i,t\}\}p\_\{n\}\(1\-p\_\{n\}\)
5:

zi,t←∑n∈𝒞i,t\(yn−pn\)/Wi,tz\_\{i,t\}\\leftarrow\\sum\_\{n\\in\\mathcal\{C\}\_\{i,t\}\}\(y\_\{n\}\-p\_\{n\}\)\\,/\\,W\_\{i,t\}
6:foreach item

i∈\{1,…,K\}i\\in\\\{1,\\ldots,K\\\}do

7:

b^i←∑tWi,t​zi,t1/σb2\+∑tWi,t\\hat\{b\}\_\{i\}\\leftarrow\\frac\{\\sum\_\{t\}W\_\{i,t\}\\,z\_\{i,t\}\}\{1/\\sigma\_\{b\}^\{2\}\+\\sum\_\{t\}W\_\{i,t\}\}
8:Fit

\(a∗,b0∗\)←arg⁡mina,b0​∑nℒBCE​\(yn,σ⁡\(a​η0,n\+b0\+b^in\)\)\(a^\{\*\},b\_\{0\}^\{\*\}\)\\leftarrow\\arg\\min\_\{a,b\_\{0\}\}\\sum\_\{n\}\\mathcal\{L\}\_\{\\text\{BCE\}\}\(y\_\{n\},\\,\\operatorname\{\\sigma\}\(a\\,\\eta\_\{0,n\}\+b\_\{0\}\+\\hat\{b\}\_\{i\_\{n\}\}\)\)
9:Prediction:

p^n=σ⁡\(a∗​η0,n\+b0∗\+b^in\)\\hat\{p\}\_\{n\}=\\operatorname\{\\sigma\}\(a^\{\*\}\\,\\eta\_\{0,n\}\+b\_\{0\}^\{\*\}\+\\hat\{b\}\_\{i\_\{n\}\}\)

Complexity isO​\(N\+K⋅T\)O\(N\+K\\cdot T\)\. On AS17 \(N≈190N\{\\approx\}190k,K≈3K\{\\approx\}3k\), wall\-clock time is under 2 s on a single CPU core\. All reported experiments use this single\-pass blockwise fit\.

## 4Experiments

### 4\.1Setup

We evaluate on four KT benchmarks spanning a range of drift intensities and data densities \(Table[1](https://arxiv.org/html/2606.14123#S4.T1)\)\. ASSISTments 2017 \(AS17\)\[[1](https://arxiv.org/html/2606.14123#bib.bib1)\]exhibits strong temporal drift with moderate data density; Eedi\[[33](https://arxiv.org/html/2606.14123#bib.bib33)\]exhibits moderate drift with similar density; ASSISTments 2009 \(AS09\)\[[9](https://arxiv.org/html/2606.14123#bib.bib9)\]has weak drift and extreme sparsity \(median 3 observations per item\); and Algebra\[[27](https://arxiv.org/html/2606.14123#bib.bib27)\]has weak\-to\-moderate drift with the most extreme sparsity \(median 1 observation per item\)\. Throughout the paper, Algebra denotes a merged dataset constructed from the 2005–2006, 2006–2007, and 2008–2009 releases; we merge these releases to obtain a longer and less fragmented temporal horizon, making drift\-trend analysis under strict temporal splits more stable\. All datasets use strict temporal splits with no overlap between training, calibration, and test windows\. This protocol differs from standard KT benchmarks that use random student\-level splits, and is essential for exposing genuine temporal drift\. The fraction of test tokens whose item was never observed during calibration \(cold\-start\) is small on AS17 \(1\.0%\) and Eedi \(0\.3%\), but substantial on AS09 \(12\.3%\) and Algebra \(24\.6%\); for these tokensSLCdefaults to the global Platt prediction \(b^i=0\\hat\{b\}\_\{i\}\{=\}0\)\.

Table 1:Dataset characteristics\. Observations per item and per bin are medians computed over the calibration window\.We evaluate five KT backbones: AKT, DKT, SAKT, DKVMN, and LPKT\. Together they cover the main KT architecture families\[[24](https://arxiv.org/html/2606.14123#bib.bib24),[23](https://arxiv.org/html/2606.14123#bib.bib23),[11](https://arxiv.org/html/2606.14123#bib.bib11),[36](https://arxiv.org/html/2606.14123#bib.bib36),[26](https://arxiv.org/html/2606.14123#bib.bib26)\]\. Each uses three seeds \(225, 226, 227\); we report mean and standard deviation across seeds\. All experiments use a strict temporal split, train→\\tocalibration→\\totest, with all hyperparameters selected by train\-only rolling backtest and no test labels used at any stage\. Our co\-primary metrics are AUC \(discrimination\) and NLL \(proper scoring rule\)\. ECE is reported as a diagnostic, not a primary criterion: it is not a proper scoring rule\[[12](https://arxiv.org/html/2606.14123#bib.bib12)\], and NLL improvements can coexist with worse ECE\. Table[2](https://arxiv.org/html/2606.14123#S4.T2)organizes the baselines by conditioning structure\.

Table 2:Baseline ladder\. Each level adds richer conditioning to test which factor drives AUC recovery\. The classic score\-only calibrators are summarized in the appendix\.Plattfitsp=σ⁡\(a​η0\+b\)p=\\operatorname\{\\sigma\}\(a\\,\\eta\_\{0\}\+b\);Platt\-Tadds a time covariatep=σ⁡\(a​η0\+c​t\+b\)p=\\operatorname\{\\sigma\}\(a\\,\\eta\_\{0\}\+c\\,t\+b\)\.ResCalestimates per\-item logit offsets by matching empirical and predicted success rates \(with a minimum\-count threshold\);ResCal\+Isoadds isotonic regression on top\.Naiveis a per\-\(item, time\-bin\) running\-average offset with no smoothing, isolating the value of Kalman shrinkage\.

### 4\.2Main Results

Tables[3](https://arxiv.org/html/2606.14123#S4.T3)and[4](https://arxiv.org/html/2606.14123#S4.T4)present AUC and NLL results averaged over five backbones and three seeds\.

Table 3:AUC \(↑\\uparrow\) averaged over 5 backbones×\\times3 seeds \(±\\pmavg seed std\)\. Best per column inbold; per\-backbone breakdown in the appendix\.The first two rows of Table[3](https://arxiv.org/html/2606.14123#S4.T3)confirm Lemma[1](https://arxiv.org/html/2606.14123#Thmlemma1):BaseandPlattproduce identical AUC\. Across the other score\-only calibrators, temperature scaling matchesPlattto four decimals of AUC on all four datasets, isotonic changes AUC negligibly, and histogram binning slightly lowers AUC through ties; the appendix reports the full table\.Platt\-T\(score \+ time, not covered by Lemma[1](https://arxiv.org/html/2606.14123#Thmlemma1)\) changes AUC only marginally \(0\.000\.00to\+0\.23​pp\+0\.23\\,\\text\{pp\}across datasets\), indicating that time\-only recalibration without item identity does not recover the stranded headroom\. The per\-item baselines \(ResCal,ResCal\+Iso\) unlock substantial headroom on AS17 \(\+2\.5​pp\+2\.5\\,\\text\{pp\}\) and Eedi \(\+3\.5​pp\+3\.5\\,\\text\{pp\}\) but fail on extremely sparse AS09 \(\+0\.02​pp\+0\.02\\,\\text\{pp\}\)\.

SLCimproves AUC overPlatton every dataset, with gains correlating with drift intensity:\+3\.68​pp\+3\.68\\,\\text\{pp\}\(AS17\),\+3\.90​pp\+3\.90\\,\\text\{pp\}\(Eedi\),\+1\.90​pp\+1\.90\\,\\text\{pp\}\(Algebra\)\. These gains are broad across backbones:SLCimproves AUC on all five backbones for AS17, AS09, and Algebra, and on four of five for Eedi; NLL follows a similar pattern, with isolated exceptions on sparse AS09 and one Eedi backbone \(per\-backbone tables in the appendix\)\. The most informative comparison is AS09:ResCalextracts negligible headroom \(\+0\.02​pp\+0\.02\\,\\text\{pp\}\) whileSLCrecovers\+2\.84​pp\+2\.84\\,\\text\{pp\}, demonstrating that Kalman shrinkage is essential in sparse regimes\. Conversely,Naive\(unsmoothed per\-item means\)*degrades*AUC by−1\.29​pp\-1\.29\\,\\text\{pp\}on AS09—unregularized estimation injects more noise than signal\.

Table 4:NLL \(↓\\downarrow\) averaged over 5 backbones×\\times3 seeds \(±\\pmavg seed std\)\. Best per column inbold\.NLL improvements \(Table[4](https://arxiv.org/html/2606.14123#S4.T4)\) largely mirror AUC:SLCachieves the best NLL on Algebra, AS17, and Eedi\. On AS09,SLCincurs only a small NLL cost overPlatt\(0\.596 vs\. 0\.592\) while gaining\+2\.84​pp\+2\.84\\,\\text\{pp\}AUC\.Naiveconfirms shrinkage importance: its AS09 NLL \(0\.770\) is catastrophically worse than the uncalibrated backbone \(0\.628\)\.SLC’s ECE is typically higher thanPlatt’s, a structural consequence of per\-item logit correction; since ECE is not a proper scoring rule\[[12](https://arxiv.org/html/2606.14123#bib.bib12)\], we rely on NLL to confirm net probabilistic benefit\. Density\-stratified analysis \(Section[4\.4](https://arxiv.org/html/2606.14123#S4.SS4)\) clarifies this relationship\.

### 4\.3The Role of Temporal Tracking

The core ablation comparesSLC\(staticbib\_\{i\}\) againstSLC\-T\(bi\+ui​\(t\)b\_\{i\}\+u\_\{i\}\(t\)\) under an identical pipeline\. Static estimation produces better AUC in all 20 configurations and equal or better NLL in 19 of 20, with average AUC advantage of\+0\.12​pp\+0\.12\\,\\text\{pp\}\(Algebra\),\+0\.16​pp\+0\.16\\,\\text\{pp\}\(AS17\),\+0\.07​pp\+0\.07\\,\\text\{pp\}\(Eedi\),\+0\.46​pp\+0\.46\\,\\text\{pp\}\(AS09\)\.

Table 5:Temporal drift detectability \(Proposition[2](https://arxiv.org/html/2606.14123#Thmproposition2)\)\.δmin\\delta\_\{\\min\}: Wald detection threshold atα=0\.05\\alpha\{=\}0\.05;\|δ^obs\|\|\\hat\{\\delta\}\_\{\\text\{obs\}\}\|: median observed adjacent\-bin logit change \(upper bound on true drift\)\. On every dataset,\|δ^obs\|≪δmin\|\\hat\{\\delta\}\_\{\\text\{obs\}\}\|\\ll\\delta\_\{\\min\}\.Table[5](https://arxiv.org/html/2606.14123#S4.T5)applies Proposition[2](https://arxiv.org/html/2606.14123#Thmproposition2)to each dataset:δmin\\delta\_\{\\min\}exceeds the observed\|δ^obs\|\|\\hat\{\\delta\}\_\{\\text\{obs\}\}\|by3\.53\.5–18×18\{\\times\}, consistent with the uniform loss of temporalSLC\-Tto staticSLCreported above\. Proposition[2](https://arxiv.org/html/2606.14123#Thmproposition2)also gives a rough viability criterion: for a moderate drift of0\.050\.05logit/bin, temporal tracking would require on the order of10510^\{5\}total obs/item\. Offset\-Platt also outperforms rawσ⁡\(η0\+b^i\)\\operatorname\{\\sigma\}\(\\eta\_\{0\}\+\\hat\{b\}\_\{i\}\)on ECE and NLL \(AS09: 7\.63% vs 10\.51%, 0\.596 vs 0\.621; full breakdown in the appendix\)\. Empirical\-Bayes shrinkage \(SLC\) dominates unsmoothed naive means by\+2\.0\+2\.0–4\.1​pp4\.1\\,\\text\{pp\}AUC across datasets\.

### 4\.4Where Does Kalman Shrinkage Help?

We stratify items into three density bins by observation count and computeΔ\\DeltaAUC per bin \(Table[6](https://arxiv.org/html/2606.14123#S4.T6), Fig\.[2](https://arxiv.org/html/2606.14123#S4.F2)\)\.

Table 6:Density\-stratifiedΔ\\DeltaAUC \(SLC−\-ResCal\+Iso\) by item observation count\.SLC’s advantage concentrates on sparse items, where Kalman shrinkage prevents noise injection\.![Refer to caption](https://arxiv.org/html/2606.14123v1/x2.png)Figure 2:Density\-stratifiedΔ\\Deltametrics \(SLC−\-ResCal\+Iso\), averaged over backbones and seeds\. On sparse items \(Bin 0\),SLCachieves simultaneous improvements in AUC, NLL, and ECE—the regime where Kalman shrinkage is most valuable\.On AS17 and Eedi, the advantage decreases monotonically with density \(\+4\.1​pp\+4\.1\\,\\text\{pp\}sparse,<0\.3​pp<0\.3\\,\\text\{pp\}dense\); on sparse itemsSLCalso improves NLL and ECE simultaneously\. On AS09,SLCimproves across all bins because even the “dense” bin \(≥\\geq7 obs\) is globally sparse\.

### 4\.5Regime Characterization

Two analyses characterize whenSLChelps and when temporal tracking becomes worthwhile\.

##### Calibration\-fraction sweep\.

Varying calibration fraction from 10% to 100%,Δ\\DeltaAUC grows monotonically on both datasets \(AS17:\+1\.24\+1\.24–\+3\.68​pp\+3\.68\\,\\text\{pp\}; AS09:\+0\.16\+0\.16–\+2\.84​pp\+2\.84\\,\\text\{pp\}\)\. On AS17, NLL improves in parallel; on AS09, the AUC gain comes with the same small NLL trade\-off seen at full calibration fraction \(see appendix\)\.

##### Synthetic regime map\.

To disentangle drift intensity from data density, we run a 1PL\-IRT simulation \(K=200K\{=\}200,T=20T\{=\}20, 5 seeds\) sweeping drift varianceQQand observations per item\. Figure[3](https://arxiv.org/html/2606.14123#S4.F3)\(a\) confirms thatΔ\\DeltaAUC of static correction scales withQQand is positive wheneverQ\>0Q\>0\. Panel \(b\) shows that temporal tracking adds≤0\.7​pp\\leq 0\.7\\,\\text\{pp\}even in the most favorable regime \(obs==300,Q=Q\{=\}0\.2\), consistent with Proposition[2](https://arxiv.org/html/2606.14123#Thmproposition2)\. Panel \(c\) shows that bias\-estimation MSE drops to near zero atQ=0Q\{=\}0, confirming that the estimator does not inject spurious corrections when no drift is present\.

![Refer to caption](https://arxiv.org/html/2606.14123v1/x3.png)Figure 3:Synthetic regime map \(1PL simulation, 5 seeds\)\.\(a\)Δ\\DeltaAUC of static per\-item correction scales with drift intensityQQ\.\(b\)Δ\\DeltaAUC of temporal over static: even at obs==300 the gain is<0\.7​pp<0\.7\\,\\text\{pp\}\.\(c\)MSE of bias estimates; theQ=0Q\{=\}0row confirms unbiased recovery\.

### 4\.6Cross\-Domain Controls

We apply the same analysis to US DoT flight\-delay data\[[30](https://arxiv.org/html/2606.14123#bib.bib30)\]\(2018–2019,∼\\sim12M flights,∼\\sim2500 routes; SGDClassifier backbone without per\-route parameters\)\. Route\-aware correction recovers about\+2​pp\+2\\,\\text\{pp\}AUC whilePlattyieldsΔ​AUC≈0\\Delta\\operatorname\{AUC\}\\approx 0; the less\-regularizedResCalis slightly stronger in this dense regime \(appendix\)\. As a negative control, MovieLens\-1M\[[15](https://arxiv.org/html/2606.14123#bib.bib15)\]with MF\[[18](https://arxiv.org/html/2606.14123#bib.bib18)\]/NCF\[[16](https://arxiv.org/html/2606.14123#bib.bib16)\]backbones yieldsΔ​AUC≈0\\Delta\\operatorname\{AUC\}\\approx 0\. Together, the controls confirm that stranded headroom is backbone\-relative: it appears when the backbone leaves residual entity\-level bias and weakens once those effects are already modeled\.

## 5Discussion and Limitations

As a batch method,SLCshould be refreshed on roughly the same cadence as global calibration\. In extremely sparse settings \(AS09: median 3 obs/item\), Kalman shrinkage still improves AUC but can incur a small NLL cost vs\.Platt\(\+0\.004\+0\.004\)\. The additive per\-item form \(Proposition[1](https://arxiv.org/html/2606.14123#Thmproposition1)\) is MSE\-optimal among item\-only corrections but cannot capture student×\\timesitem interactions; consistent gains across five backbones suggest the item\-marginal component dominates\. The effect is backbone\-relative:SLCadds less in dense regimes \(flight\-delay\) or when item effects are already modeled \(MovieLens\)\. Temporal tracking provides no benefit at current densities; Proposition[2](https://arxiv.org/html/2606.14123#Thmproposition2)predicts a viability threshold on the order of10510^\{5\}obs/item\. Student\- and skill\-level random effects and fairness audits are left to future work\.

## 6Conclusion

Global score\-only calibration is structurally AUC\-invariant; per\-item shrinkage correction recovers stranded discrimination that no global calibrator can access\.SLCis lightweight \(O​\(N\+K​T\)O\(N\{\+\}KT\)\), backbone\-agnostic, and concentrates its advantage on sparse items where shrinkage prevents catastrophic noise injection\. Temporal tracking is information\-limited at current KT densities; Proposition[2](https://arxiv.org/html/2606.14123#Thmproposition2)provides a quantitative viability criterion for future, denser deployments\.

\{credits\}

#### 6\.0\.1Acknowledgements

This work was supported by JST CREST Grant Number JPMJCR22D1, Japan\.

#### 6\.0\.2\\discintname

The authors have no competing interests to declare that are relevant to the content of this article\.

## References

- \[1\]ASSISTments: ASSIstments 2017 data mining dataset\.[https://sites\.google\.com/view/assistmentsdatamining/dataset](https://sites.google.com/view/assistmentsdatamining/dataset)\(2017\), accessed: 2026\-02\-23
- \[2\]Breslow, N\.E\., Clayton, D\.G\.: Approximate inference in generalized linear mixed models\. Journal of the American Statistical Association88\(421\), 9–25 \(1993\)
- \[3\]Corbett, A\.T\., Anderson, J\.R\.: Knowledge tracing: Modeling the acquisition of procedural knowledge\. User Modeling and User\-Adapted Interaction4\(4\), 253–278 \(1995\)
- \[4\]Durbin, J\., Koopman, S\.J\.: Time Series Analysis by State Space Methods\. Oxford University Press, 2nd edn\. \(2012\)
- \[5\]Efron, B\.: Large\-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction\. Cambridge University Press \(2010\)
- \[6\]Efron, B\., Morris, C\.: Stein’s estimation rule and its competitors—an empirical Bayes approach\. Journal of the American Statistical Association68\(341\), 117–130 \(1973\)
- \[7\]Fahrmeir, L\.: Posterior mode estimation by extended Kalman filtering for multivariate dynamic generalized linear models\. Journal of the American Statistical Association87\(418\), 501–509 \(1992\)
- \[8\]Fawcett, T\.: An introduction to ROC analysis\. Pattern Recognition Letters27\(8\), 861–874 \(2006\)
- \[9\]Feng, M\., Heffernan, N\.T\., Koedinger, K\.R\.: Addressing the assessment challenge with an online system that tutors as it assesses\. User Modeling and User\-Adapted Interaction19, 243–266 \(2009\)
- \[10\]Frenkel, L\., Goldberger, J\.: Network calibration by class\-based temperature scaling\. In: Proceedings of the European Signal Processing Conference \(EUSIPCO\) \(2021\)
- \[11\]Ghosh, A\., Heffernan, N\., Lan, A\.S\.: Context\-aware attentive knowledge tracing\. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining\. pp\. 2330–2340 \(2020\)
- \[12\]Gruber, S\.G\., Buettner, F\.: Better uncertainty calibration via proper scores for classification and beyond\. In: Advances in Neural Information Processing Systems \(NeurIPS\)\. vol\. 35 \(2022\)
- \[13\]Guo, C\., Pleiss, G\., Sun, Y\., Weinberger, K\.Q\.: On calibration of modern neural networks\. In: Proceedings of the 34th International Conference on Machine Learning \(ICML\)\. pp\. 1321–1330 \(2017\)
- \[14\]Hanley, J\.A\., McNeil, B\.J\.: The meaning and use of the area under a receiver operating characteristic \(ROC\) curve\. Radiology143\(1\), 29–36 \(1982\)
- \[15\]Harper, F\.M\., Konstan, J\.A\.: The MovieLens datasets: History and context\. ACM Transactions on Interactive Intelligent Systems5\(4\), Article 19 \(2015\)
- \[16\]He, X\., Liao, L\., Zhang, H\., Nie, L\., Hu, X\., Chua, T\.S\.: Neural collaborative filtering\. In: Proceedings of the 26th International Conference on World Wide Web \(WWW\)\. pp\. 173–182 \(2017\)
- \[17\]Kim, Y\., Sankaranarayanan, S\., Piech, C\., Thille, C\.: Variational temporal IRT: Fast, accurate, and explainable inference of dynamic learner proficiency\. In: Proceedings of the 16th International Conference on Educational Data Mining \(EDM\) \(2023\)
- \[18\]Koren, Y\., Bell, R\., Volinsky, C\.: Matrix factorization techniques for recommender systems\. Computer42\(8\), 30–37 \(2009\)
- \[19\]Lee, M\.P\., Croteau, E\., Gurung, A\., Botelho, A\.F\., Heffernan, N\.T\.: Knowledge tracing over time: A longitudinal analysis\. In: Proceedings of the 16th International Conference on Educational Data Mining \(EDM\) \(2023\)
- \[20\]Lee, M\.P\., Heffernan, N\.T\.: Concept drift detection for knowledge tracing\. In: Proceedings of the 18th International Conference on Educational Data Mining \(EDM\), Doctoral Consortium \(2025\)
- \[21\]Menon, A\.K\., Jayasumana, S\., Rawat, A\.S\., Jain, H\., Veit, A\., Kumar, S\.: Long\-tail learning via logit adjustment\. In: Proceedings of the 9th International Conference on Learning Representations \(ICLR\) \(2021\)
- \[22\]Pan, F\., Ao, X\., Tang, P\., Lu, M\., Liu, D\., Xiao, L\., He, Q\.: Field\-aware calibration: A simple and empirically strong method for reliable probabilistic predictions\. In: Proceedings of The Web Conference \(WWW\)\. pp\. 729–739 \(2020\)
- \[23\]Pandey, S\., Karypis, G\.: A self\-attentive model for knowledge tracing\. In: Proceedings of the 12th International Conference on Educational Data Mining \(EDM\) \(2019\)
- \[24\]Piech, C\., Bassen, J\., Huang, J\., Ganguli, S\., Sahami, M\., Guibas, L\.J\., Sohl\-Dickstein, J\.: Deep knowledge tracing\. In: Advances in Neural Information Processing Systems \(NeurIPS\)\. vol\. 28 \(2015\)
- \[25\]Platt, J\.C\.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods\. In: Advances in Large Margin Classifiers\. pp\. 61–74\. MIT Press \(1999\)
- \[26\]Shen, S\., Liu, Q\., Chen, E\., Huang, Z\., Huang, W\., Yin, Y\., Su, Y\., Wang, S\.: Learning process\-consistent knowledge tracing\. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining\. pp\. 1452–1460 \(2021\)
- \[27\]Stamper, J\.C\., Pardos, Z\.A\.: The 2010 KDD cup competition dataset: Engaging the machine learning community in predictive learning analytics\. Journal of Learning Analytics3\(2\), 312–316 \(2016\)
- \[28\]Tomani, C\., Cremers, D\., Buettner, F\.: Parameterized temperature scaling for boosting the expressive power in post\-hoc uncertainty calibration\. In: Proceedings of the European Conference on Computer Vision \(ECCV\) \(2022\)
- \[29\]Tomani, C\., Waseda, F\.K\., Shen, Y\., Cremers, D\.: Beyond in\-domain scenarios: Robust density\-aware calibration\. In: Proceedings of the 40th International Conference on Machine Learning \(ICML\)\. Proceedings of Machine Learning Research, vol\. 202, pp\. 34344–34368\. PMLR \(2023\)
- \[30\]U\.S\. Department of Transportation, Bureau of Transportation Statistics: On\-time: Reporting carrier on\-time performance \(1987–present\)\.[https://www\.transtats\.bts\.gov/](https://www.transtats.bts.gov/), accessed: 2026\-02\-23
- \[31\]Wang, D\., Shelhamer, E\., Liu, S\., Olshausen, B\., Darrell, T\.: Tent: Fully test\-time adaptation by entropy minimization\. In: Proceedings of the 9th International Conference on Learning Representations \(ICLR\) \(2021\)
- \[32\]Wang, X\., Berger, J\.O\., Burdick, D\.S\.: Bayesian analysis of dynamic item response models in educational testing\. The Annals of Applied Statistics7\(1\), 126–153 \(2013\)
- \[33\]Wang, Z\., Lamb, A\., Saveliev, E\., Cameron, P\., Zaykov, Y\., Hernández\-Lobato, J\.M\., Turner, R\.E\., Baraniuk, R\.G\., Barton, C\., Peyton Jones, S\., Woodhead, S\., Zhang, C\.: Results and insights from diagnostic questions: The NeurIPS 2020 education challenge\. In: NeurIPS 2020 Competition and Demonstration Track\. Proceedings of Machine Learning Research, vol\. 133, pp\. 191–205\. PMLR \(2021\)
- \[34\]Zadrozny, B\., Elkan, C\.: Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers\. In: Proceedings of the 18th International Conference on Machine Learning \(ICML\)\. pp\. 609–616 \(2001\)
- \[35\]Zadrozny, B\., Elkan, C\.: Transforming classifier scores into accurate multiclass probability estimates\. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining\. pp\. 694–699 \(2002\)
- \[36\]Zhang, J\., Shi, X\., King, I\., Yeung, D\.Y\.: Dynamic key\-value memory networks for knowledge tracing\. In: Proceedings of the 26th International Conference on World Wide Web \(WWW\)\. pp\. 765–774 \(2017\)

## Appendix

## Appendix 0\.APer\-Backbone Results

Tables[7](https://arxiv.org/html/2606.14123#Pt0.A1.T7)and[8](https://arxiv.org/html/2606.14123#Pt0.A1.T8)report AUC and NLL for each backbone individually \(mean±\\pmstd across 3 seeds\)\. Per\-backbone results show that the AUC gain is broad across architectures, with AKT on Eedi \(−0\.19​pp\-0\.19\\,\\text\{pp\}\) as the only exception; there, AKT’s item\-attention mechanism already captures most per\-item variation, leaving little headroom for post\-hoc correction\. SAKT consistently shows the largest gains \(\+2\.98\+2\.98–6\.17​pp6\.17\\,\\text\{pp\}\), consistent with its weaker per\-item modeling capacity\.

Table 7:Per\-backbone AUC \(↑\\uparrow\), mean±\\pmstd across 3 seeds\. Best per column inbold\.Table 8:Per\-backbone NLL \(↓\\downarrow\), mean±\\pmstd across 3 seeds\. Best per column inbold\.
## Appendix 0\.BClassic Score\-Only Calibrators

Table[9](https://arxiv.org/html/2606.14123#Pt0.A2.T9)makes the score\-only comparison visible: temperature scaling matchesPlattin AUC to four decimals on all four datasets, isotonic regression changes AUC only negligibly, and histogram binning can lower AUC through ties\. These results are averaged over the same five backbones as the main\-text tables\.

Table 9:Classic score\-only calibrators, averaged over 5 backbones at full calibration fraction\. Temperature scaling matchesPlattin AUC; isotonic regression changes AUC only marginally; histogram binning can reduce AUC through ties\.
## Appendix 0\.CSensitivity to Prior Varianceσb2\\sigma\_\{b\}^\{2\}

The staticSLCestimator \(Eq\. 6 in the main text\) uses a fixed prior varianceσb2=1\.0\\sigma\_\{b\}^\{2\}=1\.0for the per\-item bias\. This appendix reports a controlled sensitivity sweep overσb2∈\{0\.01,0\.1,0\.5,1\.0,5\.0,10\.0,100\.0\}\\sigma\_\{b\}^\{2\}\\in\\\{0\.01,0\.1,0\.5,1\.0,5\.0,10\.0,100\.0\\\}and explains why the result is backbone\-independent\.

We load the AS17 calibration and test splits \(188 554 and 188 565 tokens, 3 162 items; median 48 observations per item\)\. A constant backbonep0=y¯calib≈0\.55p\_\{0\}=\\bar\{y\}\_\{\\text\{calib\}\}\\approx 0\.55replaces the trained KT model\. For eachσb2\\sigma\_\{b\}^\{2\}value we: \(i\) compute the static per\-item biasb^i\\hat\{b\}\_\{i\}on the calibration set via Eq\. 6, \(ii\) fit the two\-parameter Offset\-Platt link\(a∗,b0∗\)\(a^\{\*\},b\_\{0\}^\{\*\}\)by IRLS on the same calibration set, and \(iii\) evaluate AUC and NLL on the held\-out test set\.

A constant backbone is sufficient for this sweep because the sensitivity ofb^i\\hat\{b\}\_\{i\}toσb2\\sigma\_\{b\}^\{2\}is governed entirely by the shrinkage fractionλi=Wi/\(1/σb2\+Wi\)\\lambda\_\{i\}=W\_\{i\}/\(1/\\sigma\_\{b\}^\{2\}\+W\_\{i\}\), whereWi=∑npn​\(1−pn\)W\_\{i\}=\\sum\_\{n\}p\_\{n\}\(1\{\-\}p\_\{n\}\)is the total Fisher\-information weight of itemii\. For a constant backbone,Wi=ni⋅p0​\(1−p0\)≈0\.2475​niW\_\{i\}=n\_\{i\}\\cdot p\_\{0\}\(1\{\-\}p\_\{0\}\)\\approx 0\.2475\\,n\_\{i\}; for any trained backbone with predictions spread over\[0,1\]\[0,1\], the per\-token weightpn​\(1−pn\)∈\[0,0\.25\]p\_\{n\}\(1\{\-\}p\_\{n\}\)\\in\[0,0\.25\]averages to≈0\.20\\approx 0\.20–0\.240\.24, soWiW\_\{i\}differs by<15%<15\\%\. Becauseλi\\lambda\_\{i\}is monotone inWiW\_\{i\}and saturates rapidly forWi≫1/σb2W\_\{i\}\\gg 1/\\sigma\_\{b\}^\{2\}, the sensitivity pattern is backbone\-invariant\. The constant backbone yields a slightly*higher*WiW\_\{i\}\(sincep​\(1−p\)p\(1\{\-\}p\)is maximized atp=0\.5p=0\.5\), making this a conservative test: a trained backbone would show even less sensitivity\.

Table[10](https://arxiv.org/html/2606.14123#Pt0.A3.T10)reports the sweep\. Over the two\-order\-of\-magnitude rangeσb2∈\[0\.1,10\]\\sigma\_\{b\}^\{2\}\\in\[0\.1,10\], AUC varies by<0\.4​pp<0\.4\\,\\text\{pp\}and NLL by<0\.004<0\.004\. Only the extremeσb2=0\.01\\sigma\_\{b\}^\{2\}=0\.01\(prior precision=100=100, equivalent to heavy shrinkage toward zero\) shows a visible AUC loss of∼0\.9​pp\{\\sim\}0\.9\\,\\text\{pp\}, consistent with excessive regularization suppressing genuine item biases\.

Table 10:Sensitivity ofSLCto prior varianceσb2\\sigma\_\{b\}^\{2\}on AS17 \(constant backbone\)\. The defaultσb2=1\.0\\sigma\_\{b\}^\{2\}=1\.0is shaded\.
## Appendix 0\.DConnection Between the Static Correction Block and Ridge Logistic Regression

We show that the iterative version of the static per\-item correction block is exactly the corresponding IRLS procedure forℓ2\\ell\_\{2\}\-penalized logistic regression with per\-item intercepts and the backbone logit as a fixed offset\. This clarifies the objective behind Eq\. 6 in the main text\. The deployedSLCin Algorithm 1 uses the associated single\-pass blockwise estimate, followed by a separate offset\-Platt fit\.

### 0\.D\.1Setup

ConsiderNNobservations with binary labelsyn∈\{0,1\}y\_\{n\}\\in\\\{0,1\\\}, frozen backbone logitsη0,n\\eta\_\{0,n\}, and item assignmentsin∈\{1,…,K\}i\_\{n\}\\in\\\{1,\\ldots,K\\\}\. The per\-item correction model ispn=σ​\(η0,n\+bin\)p\_\{n\}=\\sigma\(\\eta\_\{0,n\}\+b\_\{i\_\{n\}\}\), whereσ\\sigmadenotes the sigmoid function\. Theℓ2\\ell\_\{2\}\-penalized negative log\-likelihood \(“ridge logistic regression”\) is:

ℒ​\(𝐛\)=−∑n=1N\[yn​log⁡pn\+\(1−yn\)​log⁡\(1−pn\)\]\+λ2​∑i=1Kbi2,λ=1/σb2\.\\mathcal\{L\}\(\\mathbf\{b\}\)=\-\\sum\_\{n=1\}^\{N\}\\bigl\[y\_\{n\}\\log p\_\{n\}\+\(1\{\-\}y\_\{n\}\)\\log\(1\{\-\}p\_\{n\}\)\\bigr\]\+\\frac\{\\lambda\}\{2\}\\sum\_\{i=1\}^\{K\}b\_\{i\}^\{2\},\\qquad\\lambda=1/\\sigma\_\{b\}^\{2\}\.\(8\)

### 0\.D\.2Hessian is Diagonal

The gradient ofℒ\\mathcal\{L\}with respect tobib\_\{i\}is:

∂ℒ∂bi=−∑n:in=i\(yn−pn\)\+λ​bi=−gi\+λ​bi,\\frac\{\\partial\\mathcal\{L\}\}\{\\partial b\_\{i\}\}=\-\\\!\\sum\_\{n:\\,i\_\{n\}=i\}\\\!\(y\_\{n\}\-p\_\{n\}\)\+\\lambda\\,b\_\{i\}\\;=\\;\-g\_\{i\}\+\\lambda\\,b\_\{i\},wheregi≜∑n:in=i\(yn−pn\)g\_\{i\}\\triangleq\\sum\_\{n:\\,i\_\{n\}=i\}\(y\_\{n\}\-p\_\{n\}\)is the per\-item score residual\.

The Hessian entries are:

∂2ℒ∂bi​∂bj=\{Wi\+λif​i=j,0if​i≠j,\\frac\{\\partial^\{2\}\\mathcal\{L\}\}\{\\partial b\_\{i\}\\,\\partial b\_\{j\}\}=\\begin\{cases\}W\_\{i\}\+\\lambda&\\text\{if \}i=j,\\\\ 0&\\text\{if \}i\\neq j,\\end\{cases\}whereWi=∑n:in=ipn​\(1−pn\)W\_\{i\}=\\sum\_\{n:\\,i\_\{n\}=i\}p\_\{n\}\(1\{\-\}p\_\{n\}\)is the Fisher information weight\.The off\-diagonal is zerobecause each observationnninvolves exactly one itemini\_\{n\}: the derivative∂pn/∂bj=0\\partial p\_\{n\}/\\partial b\_\{j\}=0wheneverj≠inj\\neq i\_\{n\}\. HenceH=diag​\(W1\+λ,…,WK\+λ\)H=\\mathrm\{diag\}\(W\_\{1\}\{\+\}\\lambda,\\ldots,W\_\{K\}\{\+\}\\lambda\)\.

### 0\.D\.3Newton Step Decomposes intoKKIndependent Scalar Updates

The Newton–Raphson update𝐛\(ℓ\+1\)=𝐛\(ℓ\)−H−1​∇ℒ\\mathbf\{b\}^\{\(\\ell\+1\)\}=\\mathbf\{b\}^\{\(\\ell\)\}\-H^\{\-1\}\\nabla\\mathcal\{L\}reduces toKKindependent updates:

bi\(ℓ\+1\)=bi\(ℓ\)\+gi−λ​bi\(ℓ\)Wi\+λ=Wi⋅bi\(ℓ\)\+giWi\+1/σb2\.b\_\{i\}^\{\(\\ell\+1\)\}=b\_\{i\}^\{\(\\ell\)\}\+\\frac\{g\_\{i\}\-\\lambda\\,b\_\{i\}^\{\(\\ell\)\}\}\{W\_\{i\}\+\\lambda\}=\\frac\{W\_\{i\}\\cdot b\_\{i\}^\{\(\\ell\)\}\+g\_\{i\}\}\{W\_\{i\}\+1/\\sigma\_\{b\}^\{2\}\}\.\(9\)

### 0\.D\.4Comparison with the Iterated Static Correction Block

The iterated static correction block computes:

1. 1\.pn=σ​\(η0,n\+bin\(ℓ\)\)p\_\{n\}=\\sigma\(\\eta\_\{0,n\}\+b\_\{i\_\{n\}\}^\{\(\\ell\)\}\)for allnn,
2. 2\.Wi=∑n:in=ipn​\(1−pn\)W\_\{i\}=\\sum\_\{n:\\,i\_\{n\}=i\}p\_\{n\}\(1\{\-\}p\_\{n\}\),gi=∑n:in=i\(yn−pn\)g\_\{i\}=\\sum\_\{n:\\,i\_\{n\}=i\}\(y\_\{n\}\-p\_\{n\}\),
3. 3\.bi\(ℓ\+1\)=Wi⋅bi\(ℓ\)\+gi1/σb2\+Wib\_\{i\}^\{\(\\ell\+1\)\}=\\frac\{W\_\{i\}\\cdot b\_\{i\}^\{\(\\ell\)\}\+g\_\{i\}\}\{1/\\sigma\_\{b\}^\{2\}\+W\_\{i\}\}\.

This is identical to Eq\. \([9](https://arxiv.org/html/2606.14123#Pt0.A4.E9)\)\.Given the same initializationbi\(0\)=0b\_\{i\}^\{\(0\)\}=0, the iterated blockwise solver and ridge\-logistic IRLS produce the same updates at every step and converge to the same fixed point\.

> Proposition \(Blockwise ridge connection\)\.Under the conditions that \(i\) the backbone logitη0,n\\eta\_\{0,n\}is a fixed offset, \(ii\)ℓ2\\ell\_\{2\}penaltyλ=1/σb2\\lambda=1/\\sigma\_\{b\}^\{2\}is applied only to the item interceptsbib\_\{i\}, and \(iii\) initialization isbi\(0\)=0b\_\{i\}^\{\(0\)\}=0, the repeated blockwise updates of the static correction block are identical to the IRLS iterates of the corresponding penalized logistic model\. The converged block estimator is the posterior mean of a Gaussian random\-intercept model under the Laplace approximation\.

The iterated blockwise solver and ridge logistic converge to the same static correction solution, but differ in implementation\. Standard ridge logistic regression constructs a design matrix of dimensionN×KN\\times Kand solves aKK\-dimensional system at each Newton step, yieldingO​\(N​K\)O\(NK\)cost per iteration\.SLCexploits the diagonal Hessian structure by computing per\-itembincountaggregates \(O​\(N\)O\(N\)\) followed byKKscalar divisions \(O​\(K\)O\(K\)\), for a total cost ofO​\(N\+K\)O\(N\+K\)per iteration\. This decomposition also enables seamless extension to temporal smoothing via the Rauch–Tung–Striebel smoother at costO​\(K⋅T\)O\(K\\cdot T\)\.

### 0\.D\.5Empirical Verification

To confirm the theoretical equivalence, we run a fully\-converged ridge logistic regression baseline \(50 IRLS iterations, sameλ=1/σb2\\lambda\{=\}1/\\sigma\_\{b\}^\{2\}, followed by the same offset\-Platt link\) alongside the single\-passSLCon two datasets and two backbones \(3 seeds each\)\. Table[11](https://arxiv.org/html/2606.14123#Pt0.A4.T11)reports the difference: the maximum discrepancy is0\.07​pp0\.07\\,\\text\{pp\}in AUC and0\.0010\.001in NLL, both well within seed\-level variance\. The sign ofΔ\\DeltaAUC is inconsistent across configurations, confirming that the residual gap is noise rather than a systematic bias from the single\-pass approximation\.

Table 11:Ridge logistic regression vs\.SLC\(single\-pass\)\.Δ=Ridge−SLC\\Delta=\\text\{Ridge\}\-\\text\{\{SLC\}\}, averaged over 3 seeds \(±\\pmstd\)\. All differences are within seed\-level noise\.If the backbone scaleaais jointly optimized withbib\_\{i\}\(as in standard sklearnLogisticRegressionwith per\-item one\-hot features\), the Hessian acquires off\-diagonal blocks∂2ℒ/∂a​∂bi≠0\\partial^\{2\}\\mathcal\{L\}/\\partial a\\,\\partial b\_\{i\}\\neq 0, and the global Newton step no longer decomposes\.SLCtherefore separates bias estimation \(bib\_\{i\}only\) from link estimation \(\(a,b0\)\(a,b\_\{0\}\)via offset\-Platt\), preserving the additive\-offset parameterization of Corollary 1 in the main text without claiming a joint Newton step for the full pipeline\.

## Appendix 0\.ELink Ablation: Offset\-Platt vs\. Raw Sigmoid

The main text \(Section 4\.3\) reports that offset\-Platt outperforms rawσ​\(η0\+b^i\)\\sigma\(\\eta\_\{0\}\+\\hat\{b\}\_\{i\}\)on ECE and NLL\. Table[12](https://arxiv.org/html/2606.14123#Pt0.A5.T12)provides the full breakdown across datasets and link types\.

Table 12:Link ablation on AS09 and AS17 \(5 backbones×\\times3 seeds, averaged\)\.Raw:σ​\(η0\+b^i\)\\sigma\(\\eta\_\{0\}\+\\hat\{b\}\_\{i\}\);\+OP: offset\-Plattσ​\(a​η0\+b0\+b^i\)\\sigma\(a\\eta\_\{0\}\+b\_\{0\}\+\\hat\{b\}\_\{i\}\)\. Static variants usebib\_\{i\}only; temporal variants addui​\(t\)u\_\{i\}\(t\)\.Adding offset\-Platt consistently improves ECE and NLL without sacrificing AUC\. On AS09: offset\-Platt reduces ECE from 10\.51% to 7\.63% \(−2\.88​pp\-2\.88\\,\\text\{pp\}\) and NLL from 0\.621 to 0\.596 \(−0\.025\-0\.025\)\. On AS17: ECE from 2\.79% to 1\.70% \(−1\.09​pp\-1\.09\\,\\text\{pp\}\) and NLL from 0\.596 to 0\.593 \(−0\.002\-0\.002\)\. The offset\-Platt link correctly treatsb^i\\hat\{b\}\_\{i\}as a random effect that should not be rescaled by the global parameteraa, matching the GLMM parameterization \(Corollary 1 in the main text\)\.

## Appendix 0\.FCalibration\-Fraction Sweep

The main text \(Section 4\.5\) reports thatΔ\\DeltaAUC scales monotonically with calibration fraction\. Table[13](https://arxiv.org/html/2606.14123#Pt0.A6.T13)provides the detailed results on AS17 and AS09, averaged over 5 backbones×\\times3 seeds\. All numbers in Table[13](https://arxiv.org/html/2606.14123#Pt0.A6.T13)use the final single\-passSLCpipeline\.

Table 13:Calibration\-fraction sweep on AS17 and AS09 \(5 backbones×\\times3 seeds, averaged\)\.SLC’sΔ\\DeltaAUC overPlattscales monotonically with calibration data, confirming signal\-driven improvement\.On AS17, bothΔ\\DeltaAUC andΔ\\DeltaNLL improve monotonically as more calibration data become available, confirming that the gain is signal\-driven rather than an artifact of a particular split\. On AS09,Δ\\DeltaAUC is also monotone, but NLL remains slightly abovePlattthroughout; the gap shrinks from\+0\.012\+0\.012at 50% calibration data to\+0\.003\+0\.003at 100%\. This matches the main\-text conclusion that in extremely sparse regimes,SLCrecovers substantial ranking headroom while retaining a small proper\-score trade\-off againstPlatt\.

## Appendix 0\.GFlight\-Delay and MovieLens Experiments

### 0\.G\.1Flight\-Delay \(Positive Control\)

The main text \(Section 4\.6\) uses flight\-delay as a positive control for the claim that the same phenomenon can arise beyond education when the deployed backbone leaves route\-level bias\. Table[14](https://arxiv.org/html/2606.14123#Pt0.A7.T14)provides the full results across two backbone variants and all methods\.

We use US Department of Transportation on\-time performance data from 2018–2019 \(∼\\sim12M flights,∼\\sim2500 routes\)\. We evaluate two backbone variants:bbA\(SGDClassifier with carrier/origin/dest features, no per\-route parameters\) andbbB\(the same model family with different regularization\)\. The temporal split uses 2018 for training, early 2019 for calibration, and late 2019 for test, with three random seeds for backbone training\.

Table 14:Flight\-delay experiment \(2 backbones×\\times3 seeds, mean±\\pmstd\)\. Route\-aware correction recovers AUC headroom that score\-only calibration cannot access; in this dense regime, the less\-regularizedResCalbaseline attains the highest AUC\.BaseandPlattproduce identical AUC \(confirming Lemma 1\)\.Platt\-T*decreases*AUC \(−0\.8​pp\-0\.8\\,\\text\{pp\}on bbA\), showing that time\-only conditioning without route identity is harmful\. Per\-route methods \(ResCal,SLC\) recover\+2\+2–3​pp3\\,\\text\{pp\}AUC headroom, so the positive\-control conclusion is structural rather than method\-specific\.ResCalachieves the highest AUC, whileSLCis slightly more conservative in this dense regime; this matches the main\-text density analysis, where shrinkage helps most when per\-item calibration data are scarce and can add bias when route\-level data are abundant\.SLC\+Iso achieves the best NLL \(0\.467–0\.468\) while preserving the route\-aware AUC gain\.

### 0\.G\.2MovieLens\-1M \(Negative Control\)

The main text reports that MovieLens\-1M with a matrix\-factorization backbone yieldsΔ​AUC≈0\\Delta\\operatorname\{AUC\}\\approx 0\. Table[15](https://arxiv.org/html/2606.14123#Pt0.A7.T15)confirms this\.

We use MovieLens\-1M \(∼\\sim1M ratings, 3706 movies, 6040 users\) as a binary classification task, with ratings≥4\\geq 4treated as positive\. The two backbones areMF\(matrix factorization with per\-item embeddings\) andNCF\(neural collaborative filtering with per\-item embeddings\)\. We use a temporal split by timestamp and report means over 3 seeds\.

Table 15:MovieLens\-1M negative control \(3 seeds, mean\)\. When the backbone already models per\-item effects, post\-hoc per\-item correction provides negligible or no benefit\.On the MF backbone,Δ\\DeltaAUC betweenBaseandSLCis\+0\.02​pp\+0\.02\\,\\text\{pp\}—effectively zero\. On the NCF backbone,SLC*decreases*AUC by−0\.42​pp\-0\.42\\,\\text\{pp\}, andNaivedegrades by−3\.13​pp\-3\.13\\,\\text\{pp\}\. Together with flight\-delay, this indicates that the effect is backbone\-relative rather than domain\-specific: when the backbone already incorporates item effects via embeddings, post\-hoc per\-item correction adds noise without recovering headroom\. The MovieLens result validatesSLC’s applicability boundary and demonstrates that the method is appropriately conservative \(does not inflate metrics artificially\)\.

Similar Articles