Embedding Foundation Model Predictions in Discrete-Choice Models with Structural Guarantees
Summary
This paper proposes a two-stage adapter that embeds foundation model predictions into a multinomial logit model, preserving economic properties like cost monotonicity and interpretable willingness-to-pay while improving accuracy by up to 12.8 percentage points.
View Cached Full Text
Cached at: 06/26/26, 05:19 AM
# Embedding Foundation Model Predictions in Discrete-Choice Models with Structural Guarantees
Source: [https://arxiv.org/html/2606.26432](https://arxiv.org/html/2606.26432)
Yingshuo Wang1Xian Sun2Yanhang Li3Zhichao Fan4Zexin Zhuang51University of California, Berkeley, CA, USA\.2Duke University, Durham, NC, USA\.3Northeastern University, Boston, MA, USA\.4University of Illinois Urbana\-Champaign, IL, USA\.5Southern Methodist University, Dallas, TX, USA\. Correspondence to: Yingshuo Wang<yingshuow@berkeley\.edu\>\.
###### Abstract
Tabular foundation models achieve strong accuracy on choice prediction tasks, but their predictions often violate the economic logic those tasks require: raising a price can increase predicted demand, implied willingness\-to\-pay estimates are frequently negative or implausible, and unavailable alternatives receive nonzero probability\. We propose a two\-stage adapter that takes a foundation model’s predicted choice probabilities as a precomputed feature and embeds them inside a multinomial logit’s utility\. In Stage 1, we fit the multinomial logit’s structural coefficients by maximum likelihood with sign constraints; in Stage 2, we freeze those coefficients and fit a small neural correction operating on the foundation model’s predictions\. We prove that this composition exactly preserves the multinomial logit’s marginal rate of substitution, so analytically computable value\-of\-time becomes a mathematical guarantee rather than an empirical accident\. Across three datasets and two foundation models, the adapter gains 6\.4 percentage points \(pp\) of test accuracy on average over the multinomial logit and up to 12\.8 pp, maintains 100% cost monotonicity, and produces values of time within the published transportation\-economics range on the transportation datasets\. Performance degrades gracefully under foundation\-model context restriction, retaining at least 6 pp of accuracy gain even at 10% of the original foundation\-model context\.
## 1Introduction
Discrete\-choice models guide policy decisions with significant economic stakes: the per\-minute value of time from commuter mode\-choice forecasts\(Ben\-Akiva and Lerman,[1985](https://arxiv.org/html/2606.26432#bib.bib2); Train,[2009](https://arxiv.org/html/2606.26432#bib.bib3)\)anchors cost\-benefit appraisal of multi\-billion\-dollar rail and road investments, and willingness\-to\-pay estimates from consumer discrete\-choice experiments set prices and predict how labeling regimes shift behavior\. A choice model must do two things: forecast which alternative is chosen, and forecast how the choice responds to intervention\.
Multinomial\-logit utility models\(Ben\-Akiva and Lerman,[1985](https://arxiv.org/html/2606.26432#bib.bib2); Train,[2009](https://arxiv.org/html/2606.26432#bib.bib3)\)satisfy the second requirement by construction: a sign\-constrained cost coefficient gives monotone demand, theβtime/βcost\\beta\_\{\\text\{time\}\}/\\beta\_\{\\text\{cost\}\}ratio gives interpretable willingness\-to\-pay, and unavailable alternatives receive zero probability\. They satisfy the first only modestly\. Modern machine\-learning models invert the trade\-off, raising accuracy at the cost of breaking each structural property in turn\(Hillelet al\.,[2021](https://arxiv.org/html/2606.26432#bib.bib6); van Cranenburghet al\.,[2022](https://arxiv.org/html/2606.26432#bib.bib7); Zhaoet al\.,[2020](https://arxiv.org/html/2606.26432#bib.bib8)\)\.
Tabular foundation models such as TabPFN\(Hollmannet al\.,[2023](https://arxiv.org/html/2606.26432#bib.bib11),[2025](https://arxiv.org/html/2606.26432#bib.bib12)\)and Mitra\(Maddix Robinsonet al\.,[2025](https://arxiv.org/html/2606.26432#bib.bib13)\)intensify the tension by raising the accuracy ceiling while inheriting the same structural failures\. Existing remedies each carry costs: architecturally constrained monotonic neural networks\(Sill,[1997](https://arxiv.org/html/2606.26432#bib.bib22); Wehenkel and Louppe,[2019](https://arxiv.org/html/2606.26432#bib.bib23); Sartoret al\.,[2025](https://arxiv.org/html/2606.26432#bib.bib19)\)eliminate monotonicity violations but lose the closed\-form trade\-off ratio; knowledge distillation\(Hintonet al\.,[2015](https://arxiv.org/html/2606.26432#bib.bib17)\)into a multinomial\-logit student preserves the structural guarantees but cannot match the teacher’s accuracy; penalized fine\-tuning needs gradient access through the foundation model’s weights, incompatible with the in\-context learning that TabPFN and Mitra use at inference\.
We propose a two\-stage adapter that embeds the foundation model’s predicted probability vector inside a constrained multinomial logit as a precomputed feature\. In Stage 1, we fit the structural coefficients by maximum likelihood under the usual MNL constraints with the correction held at zero; in Stage 2, we freeze those coefficients and fit a small neural correction\. We prove that this two\-stage procedure preserves the structural marginal rate of substitution exactly, and that joint training of the structural coefficients and the correction does not: correction expressivity creates a one\-parameter family of likelihood\-equivalent solutions, breaking identifiability\.
Our contributions are:
Adapter\.A two\-stage behavioral adapter that preserves the structural multinomial logit’s economic guarantees by construction while recovering most of a foundation model’s accuracy advantage\.
Theory\.Propositions[1](https://arxiv.org/html/2606.26432#Thmproposition1)and[2](https://arxiv.org/html/2606.26432#Thmproposition2)identify two\-stage training as the design choice preserving structural identifiability and characterize the failure mode of joint training\.
Evaluation\.Three discrete\-choice datasets, two foundation models, six ablations, a feature\-augmented multinomial\-logit baseline, calibration analysis with three post\-hoc methods, and a counterfactual aggregate\-share evaluation\. The accuracy gain is positive in1010of1010bootstrap replicates on every \(dataset, foundation\-model\) cell, withp≈0\.002p\\approx 0\.002under the exact binomial sign test and McNemar paired\-observationpp\-values below10−2010^\{\-20\}on the larger datasets\.
Audit pipeline\.A behavioral\-audit recipe applicable to any discrete\-choice dataset and any predict\-function, generalizing value\-of\-time auditing to willingness\-to\-pay for arbitrary non\-cost attributes and to cluster\-aware aggregation for panel data\.
## 2Related work
#### Machine learning and economic consistency in choice modeling\.
Discrete\-choice modeling has been the standard tool in transportation economics and consumer behavior sinceBen\-Akiva and Lerman \([1985](https://arxiv.org/html/2606.26432#bib.bib2)\)\. Recent work pushes machine learning into the field while exposing a tension with economic structure:Hillelet al\.\([2021](https://arxiv.org/html/2606.26432#bib.bib6)\)document consistent accuracy gains across neural and ensemble methods over the multinomial logit but few behavioral diagnostics;Zhaoet al\.\([2020](https://arxiv.org/html/2606.26432#bib.bib8)\)report frequent monotonicity violations and implausible willingness\-to\-pay estimates from machine\-learning choice models; andvan Cranenburghet al\.\([2022](https://arxiv.org/html/2606.26432#bib.bib7)\)frame economic consistency as an open challenge distinct from prediction accuracy\.Hanet al\.\([2022](https://arxiv.org/html/2606.26432#bib.bib9)\)\(TasteNet\) propose a neural\-embedded discrete\-choice model that learns taste parameters as neural\-network functions of individual characteristics, targeting taste\-parameter interpretability through learned heterogeneity rather than the structural\-plus\-correction decomposition we adopt here\. This work builds on our workshop paper\(Wanget al\.,[2026](https://arxiv.org/html/2606.26432#bib.bib1)\); the present paper extends it with an additional dataset, formal preservation results, and counterfactual aggregate\-share evaluation\. Our work additionally derives a formal preservation result, evaluates across foundation models, and characterizes joint training’s failure mode\.
#### Architecturally constrained monotonic neural networks\.
Architectural approaches enforce monotonicity through sign\-constrained weights and monotone activations\(Sill,[1997](https://arxiv.org/html/2606.26432#bib.bib22); Wehenkel and Louppe,[2019](https://arxiv.org/html/2606.26432#bib.bib23); Sartoret al\.,[2025](https://arxiv.org/html/2606.26432#bib.bib19)\), eliminating cost\-monotonicity violations but losing the closed\-form trade\-off ratio: with no global coefficients, the value\-of\-time analogue must be computed from per\-observation gradients rather than read off asβtime/βcost\\beta\_\{\\text\{time\}\}/\\beta\_\{\\text\{cost\}\}\. Our adapter takes a different point: monotonicity and trade\-off ratios come from the structural component, whose constrained parameterization is preserved throughout training, while the foundation model contributes a non\-differentiable side channel\.
#### Tabular foundation models and knowledge distillation\.
Tabular foundation models such as TabPFN\(Hollmannet al\.,[2023](https://arxiv.org/html/2606.26432#bib.bib11),[2025](https://arxiv.org/html/2606.26432#bib.bib12)\)and Mitra\(Maddix Robinsonet al\.,[2025](https://arxiv.org/html/2606.26432#bib.bib13)\)achieve strong classification accuracy via in\-context learning\. Their published benchmarks emphasize predictive accuracy; structural\-validity diagnostics specific to choice modeling \(cost\-monotone responses, finite positive trade\-off ratios, zero probability on unavailable alternatives\) at test\-row coverage are not part of these benchmarks\. Our audit fills this gap\. Knowledge distillation\(Hintonet al\.,[2015](https://arxiv.org/html/2606.26432#bib.bib17)\)into a multinomial\-logit student preserves structural guarantees but is bounded by the student’s expressive capacity: a plain MNL student parameterizes a strictly linear\-in\-features utility, and the residual variance the foundation model captures lives above that ceiling\. We position the foundation model differently: its predictions become an explanatory feature embedded inside a structurally constrained utility, with the foundation model’s parameters never modified\.
## 3Method
### 3\.1Setup and notation
We consider a discrete\-choice setting withNNobservations indexed byi∈\{1,…,N\}i\\in\\\{1,\\dots,N\\\}\. Each observation has a feature vector𝐱i∈𝒳\\mathbf\{x\}\_\{i\}\\in\\mathcal\{X\}, a set of available alternatives𝒦i⊆\{1,…,K\}\\mathcal\{K\}\_\{i\}\\subseteq\\\{1,\\dots,K\\\}, and an observed choiceyi∈𝒦iy\_\{i\}\\in\\mathcal\{K\}\_\{i\}\. For panel data we additionally observe a subject identifiersis\_\{i\}, since multiple observations from the same subject are not exchangeable; we cluster onsis\_\{i\}throughout\.
A choice model produces probabilitiesPk\(𝐱i\)∈\[0,1\]P\_\{k\}\(\\mathbf\{x\}\_\{i\}\)\\in\[0,1\]summing to one over𝒦i\\mathcal\{K\}\_\{i\}, i\.e\., a vector on the\(K−1\)\(K\{\-\}1\)\-simplexΔK−1\\Delta^\{K\-1\}\. The standard multinomial logit \(MNL\) parameterizes these through a linear utilityVk\(𝐱i\)=𝜷⊤ϕk\(𝐱i\)V\_\{k\}\(\\mathbf\{x\}\_\{i\}\)=\\boldsymbol\{\\beta\}^\{\\top\}\\boldsymbol\{\\phi\}\_\{k\}\(\\mathbf\{x\}\_\{i\}\)and the softmaxPk=expVk/∑j∈𝒦iexpVjP\_\{k\}=\\exp V\_\{k\}/\\sum\_\{j\\in\\mathcal\{K\}\_\{i\}\}\\exp V\_\{j\}\(Ben\-Akiva and Lerman,[1985](https://arxiv.org/html/2606.26432#bib.bib2)\)\. The alternative\-specific feature transformϕk\\boldsymbol\{\\phi\}\_\{k\}selects from𝐱i\\mathbf\{x\}\_\{i\}the columns that enter alternativekk’s utility, typically including cost, time, alternative\-specific constants, and sociodemographic interactions\. We treatϕk\\boldsymbol\{\\phi\}\_\{k\}as fixed by the dataset’s specification and recover𝜷\\boldsymbol\{\\beta\}by maximum likelihood\.
### 3\.2Behavioral audit
Every model is evaluated through three model\-agnostic functionals that take only a predict function𝐱↦P\(𝐱\)∈ΔK−1\\mathbf\{x\}\\mapsto P\(\\mathbf\{x\}\)\\in\\Delta^\{K\-1\}as input\. We say model\-agnostic in the sense that the audit reads model outputs only, requiring no gradient access or knowledge of internals\.
Intervention protocol\.For adapter and feature\-augmented MNL we use the*fixed\-𝐪\\mathbf\{q\}*protocol throughout: when a cost or attribute is perturbed at rowii, the foundation\-model probability vector𝐪i\\mathbf\{q\}\_\{i\}is held fixed at the value computed on the unperturbed𝐱i\\mathbf\{x\}\_\{i\}, and the perturbation enters onlyVkstructV^\{\\text\{struct\}\}\_\{k\}\. The alternative*recomputed\-𝐪\\mathbf\{q\}*protocol re\-runs the foundation model on perturbed inputs; we use it for the raw foundation\-model counterfactual evaluation \(Section[5\.3](https://arxiv.org/html/2606.26432#S5.SS3)\) only, and forfeit the structural guarantees in that case\.
The three functionals are: \(1\)*Monotonicity\.*For each test rowiiand alternativekk, perturbkk’s cost upward by1%1\\%of its observed range and check whetherPkP\_\{k\}falls; we report the observation\-level rate \(cluster\-aware on panel data\)\. \(2\)*Trade\-off ratio\.*The marginal rate of substitution between non\-cost attributeaaand costbb, reported in the standard transportation\-economics sign convention so that value of time and willingness to pay for utility\-improving attributes are positive:ρa,bVOT=\(∂Pk/∂a\)/\(∂Pk/∂b\)\\rho^\{\\mathrm\{VOT\}\}\_\{a,b\}=\(\\partial P\_\{k\}/\\partial a\)/\(\\partial P\_\{k\}/\\partial b\)when both partials are negative \(e\.g\., time and cost both lower utility\), andρa,bWTP=−\(∂Pk/∂a\)/\(∂Pk/∂b\)\\rho^\{\\mathrm\{WTP\}\}\_\{a,b\}=\-\(\\partial P\_\{k\}/\\partial a\)/\(\\partial P\_\{k\}/\\partial b\)when∂Pk/∂a\\partial P\_\{k\}/\\partial ais positive and∂Pk/∂b\\partial P\_\{k\}/\\partial bis negative \(a desirable non\-cost attribute vs cost\)\. Estimated by finite differences scaled to1%1\\%of each column’s observed range\. \(3\)*Availability compliance\.*For datasets in which the available set𝒦i\\mathcal\{K\}\_\{i\}varies across observations \(e\.g\., Swissmetro, where the proposed Swissmetro option is not offered to some respondents\),Leak\(M\)=𝔼i\[∑k∉𝒦iPk\(𝐱i\)\]\\mathrm\{Leak\}\(M\)=\\mathbb\{E\}\_\{i\}\[\\sum\_\{k\\notin\\mathcal\{K\}\_\{i\}\}P\_\{k\}\(\\mathbf\{x\}\_\{i\}\)\]measures the predicted probability assigned to formally unavailable alternatives\. The multinomial logit applies softmax over𝒦i\\mathcal\{K\}\_\{i\}only, so its leak is mechanically zero; black\-box predictors that ignore the availability mask can leak nonzero probability onto unavailable alternatives\.
### 3\.3Two\-stage behavioral adapter
Architecture\.For each observationii, alternativekk, and a precomputed foundation\-model probability vector𝐪\(𝐱i\)∈ΔK−1\\mathbf\{q\}\(\\mathbf\{x\}\_\{i\}\)\\in\\Delta^\{K\-1\}obtained by a single forward pass through the foundation model on the raw input, the adapter’s utility is
Vk\(𝐱i\)=𝜷⊤ϕk\(𝐱i\)⏟Vkstruct\(economic structure\)\+gk\(𝐪\(𝐱i\)\)⏟foundation\-model correction,V\_\{k\}\(\\mathbf\{x\}\_\{i\}\)=\\underbrace\{\\boldsymbol\{\\beta\}^\{\\top\}\\boldsymbol\{\\phi\}\_\{k\}\(\\mathbf\{x\}\_\{i\}\)\}\_\{V\_\{k\}^\{\\text\{struct\}\}\\text\{ \(economic structure\)\}\}\+\\underbrace\{g\_\{k\}\(\\mathbf\{q\}\(\\mathbf\{x\}\_\{i\}\)\)\}\_\{\\text\{foundation\-model correction\}\},\(1\)whereg:ΔK−1→ℝKg\\colon\\Delta^\{K\-1\}\\to\\mathbb\{R\}^\{K\}is a small MLP \(two hidden layers, width3232\)\. To enforce cost/time monotonicity throughout training we reparameterize each such coefficient asβ=−exp\(θ\)\\beta=\-\\exp\(\\theta\)\.
Two\-stage training\.We fit the model in two sequential stages\.*Stage 1*: with the correction held atg≡0g\\equiv 0, fit𝜷\\boldsymbol\{\\beta\}by maximum likelihood under the sign constraints, recovering the standalone\-MNL estimate𝜷∗\\boldsymbol\{\\beta\}^\{\\ast\}\.*Stage 2*: fix𝜷=𝜷∗\\boldsymbol\{\\beta\}=\\boldsymbol\{\\beta\}^\{\\ast\}and fit onlyggby maximum likelihood\. The architecture realizes Stage 1 cleanly becausegg’s output layer is zero\-initialized; hidden weights use He initialization\(Heet al\.,[2015](https://arxiv.org/html/2606.26432#bib.bib25)\)\. Because𝐪\(𝐱i\)\\mathbf\{q\}\(\\mathbf\{x\}\_\{i\}\)is precomputed,∂𝐪/∂𝐱i=0\\partial\\mathbf\{q\}/\\partial\\mathbf\{x\}\_\{i\}=0, so monotonicity and trade\-off ratios reduce to functions of𝜷∗\\boldsymbol\{\\beta\}^\{\\ast\}alone\. Propositions[1](https://arxiv.org/html/2606.26432#Thmproposition1)–[2](https://arxiv.org/html/2606.26432#Thmproposition2)make this precise and identify joint training as the failure mode\.
### 3\.4Propositions
We use𝜷∗\\boldsymbol\{\\beta\}^\{\\ast\}for the Stage 1 maximum\-likelihood estimate and reuse the notation of \([1](https://arxiv.org/html/2606.26432#S3.E1)\)\.
###### Proposition 1\(Marginal\-rate\-of\-substitution preservation under two\-stage training, fixed\-𝐪\\mathbf\{q\}protocol\)\.
Let𝛃∗\\boldsymbol\{\\beta\}^\{\\ast\}be the Stage 1 maximum\-likelihood estimate, letggbe any Stage 2 parameters in \([1](https://arxiv.org/html/2606.26432#S3.E1)\), and operate under the fixed\-𝐪\\mathbf\{q\}protocol of Section[3\.2](https://arxiv.org/html/2606.26432#S3.SS2)\. For any two attributesj,j′j,j^\{\\prime\}that enter the model only throughVkstruct\(𝐱i\)=𝛃⊤ϕk\(𝐱i\)V^\{\\text\{struct\}\}\_\{k\}\(\\mathbf\{x\}\_\{i\}\)=\\boldsymbol\{\\beta\}^\{\\top\}\\boldsymbol\{\\phi\}\_\{k\}\(\\mathbf\{x\}\_\{i\}\), and for any𝐱i\\mathbf\{x\}\_\{i\}at whichϕk\\boldsymbol\{\\phi\}\_\{k\}is differentiable inxijx\_\{ij\}andxij′x\_\{ij^\{\\prime\}\},
MRSj,j′\(𝐱i\)≡∂Vk\(𝐱i\)/∂xij∂Vk\(𝐱i\)/∂xij′=βj∗βj′∗\.\\mathrm\{MRS\}\_\{j,j^\{\\prime\}\}\(\\mathbf\{x\}\_\{i\}\)\\;\\equiv\\;\\frac\{\\partial V\_\{k\}\(\\mathbf\{x\}\_\{i\}\)/\\partial x\_\{ij\}\}\{\\partial V\_\{k\}\(\\mathbf\{x\}\_\{i\}\)/\\partial x\_\{ij^\{\\prime\}\}\}\\;=\\;\\frac\{\\beta^\{\\ast\}\_\{j\}\}\{\\beta^\{\\ast\}\_\{j^\{\\prime\}\}\}\.In particular, the value\-of\-time analogueβtime∗/βcost∗\\beta^\{\\ast\}\_\{\\text\{time\}\}/\\beta^\{\\ast\}\_\{\\text\{cost\}\}is identical to the standalone multinomial logit’s value\-of\-time and is invariant to the choice of foundation model and to the choice ofgg\. Under the recomputed\-𝐪\\mathbf\{q\}protocol,ggcontributes a chain\-rule term through𝐪\\mathbf\{q\}to each partial and the ratio no longer reduces toβj∗/βj′∗\\beta^\{\\ast\}\_\{j\}/\\beta^\{\\ast\}\_\{j^\{\\prime\}\}in general\.
- •Corollary \(probability\-derivative MRS\)\.Under the same fixed\-𝐪\\mathbf\{q\}protocol, restrict attention to attributesj,j′j,j^\{\\prime\}that enter only alternativekk’s utility\. Then the audit estimator equals the structural coefficient ratio under the audit’s sign convention:ρj,j′VOT\(𝐱i\)=βj∗/βj′∗\\rho^\{\\mathrm\{VOT\}\}\_\{j,j^\{\\prime\}\}\(\\mathbf\{x\}\_\{i\}\)=\\beta^\{\\ast\}\_\{j\}/\\beta^\{\\ast\}\_\{j^\{\\prime\}\}andρj,j′WTP\(𝐱i\)=−βj∗/βj′∗\\rho^\{\\mathrm\{WTP\}\}\_\{j,j^\{\\prime\}\}\(\\mathbf\{x\}\_\{i\}\)=\-\\beta^\{\\ast\}\_\{j\}/\\beta^\{\\ast\}\_\{j^\{\\prime\}\}, since the softmax derivative’sPk\(1−Pk\)P\_\{k\}\(1\-P\_\{k\}\)factor cancels in the ratio \(Appendix[A\.1](https://arxiv.org/html/2606.26432#A1.SS1)\)\.
- •Intuition\.Fixed𝐪\\mathbf\{q\}forces∂𝐪/∂𝐱=0\\partial\\mathbf\{q\}/\\partial\\mathbf\{x\}=0, soggvanishes from∂Vk/∂xij\\partial V\_\{k\}/\\partial x\_\{ij\}and the ratio collapses toβj∗/βj′∗\\beta^\{\\ast\}\_\{j\}/\\beta^\{\\ast\}\_\{j^\{\\prime\}\}\. Full proof in Appendix[A\.1](https://arxiv.org/html/2606.26432#A1.SS1)\.
###### Proposition 2\(Joint training breaks structural identifiability\)\.
Assume there exists a continuousκk:ΔK−1→ℝ\\kappa\_\{k\}:\\Delta^\{K\-1\}\\to\\mathbb\{R\}such thatcostk\(𝐱\)=κk\(𝐪\(𝐱\)\)\\mathrm\{cost\}\_\{k\}\(\\mathbf\{x\}\)=\\kappa\_\{k\}\(\\mathbf\{q\}\(\\mathbf\{x\}\)\)on the closure of the training support \(*cost\-recoverability assumption*\); the correction class𝒢\\mathcal\{G\}is dense inC0\(ΔK−1,ℝK\)C^\{0\}\(\\Delta^\{K\-1\},\\mathbb\{R\}^\{K\}\)\(Cybenko,[1989](https://arxiv.org/html/2606.26432#bib.bib27)\); andL\(𝛃,g\)L\(\\boldsymbol\{\\beta\},g\)is minimized jointly without two\-stage constraints\. Then for any joint minimizer\(𝛃\(0\),g\(0\)\)\(\\boldsymbol\{\\beta\}^\{\(0\)\},g^\{\(0\)\}\)there exists a one\-parameter family\{\(𝛃\(c\),g\(c\)\):c∈𝒞\}\\\{\(\\boldsymbol\{\\beta\}^\{\(c\)\},g^\{\(c\)\}\):c\\in\\mathcal\{C\}\\\}of distinct configurations achieving identical loss, parametrized byβcost\(c\)=βcost\(0\)\+c\\beta^\{\(c\)\}\_\{\\text\{cost\}\}=\\beta^\{\(0\)\}\_\{\\text\{cost\}\}\+cforccin an open interval𝒞\\mathcal\{C\}preserving sign constraints onβcost\\beta\_\{\\text\{cost\}\}\. Gradient descent within𝒞\\mathcal\{C\}selects an initialization\-dependent point rather than the MNL MLE𝛃∗\\boldsymbol\{\\beta\}^\{\\ast\}\.
- •Note \(cost\-recoverability\)\.The assumption is dataset\-dependent; partial\-recoverability cases in Appendix[B\.1](https://arxiv.org/html/2606.26432#A2.SS1)\.
- •Intuition\.Subtractingc⋅κk\(𝐪\)c\\cdot\\kappa\_\{k\}\(\\mathbf\{q\}\)fromgkg\_\{k\}shiftsβcost\\beta\_\{\\text\{cost\}\}by\+c\+cwithout changing pointwise utility: the structural increasec⋅costkc\\cdot\\mathrm\{cost\}\_\{k\}inVkV\_\{k\}is cancelled by the correction\-side decrease\. A3 \(§[6](https://arxiv.org/html/2606.26432#S6)\) is the empirical illustration; full proof in Appendix[A\.2](https://arxiv.org/html/2606.26432#A1.SS2)\.
## 4Experimental setup
#### Datasets\.
We evaluate on three discrete\-choice datasets\. Swissmetro\(Bierlaireet al\.,[2001](https://arxiv.org/html/2606.26432#bib.bib4)\):10,71910\{,\}719stated\-preference commuter choices among rail, the proposed Swissmetro, and car\. LPMC\(Hillelet al\.,[2018](https://arxiv.org/html/2606.26432#bib.bib5)\):81,08681\{,\}086revealed\-preference London trips among walk, cycle, public transport, and drive\. IoT\-Wearables\(Johnsonet al\.,[2020](https://arxiv.org/html/2606.26432#bib.bib21)\):6,3626\{,\}362stated\-preference choices among three Internet\-of\-Things wearable devices varying in price, functional features, and a security/privacy labeling scheme \(panel data,728728subjects\)\. All splits are70/15/1570/15/15\. IoT\-Wearables uses subject\-level splitting \(each subject’s rows go entirely to one split\) to prevent within\-subject leakage\. Swissmetro and LPMC use the stratified row\-level splits inherited from prior released parquets \(Bierlaireet al\.,[2001](https://arxiv.org/html/2606.26432#bib.bib4)for Swissmetro and the LPMC public release\); because Swissmetro is stated\-preference with repeated choice tasks per respondent, row\-level splitting may leak respondent\-specific preferences across train/val/test, which we flag as a limitation\.
#### Foundation\-model inputs\.
The foundation\-model input columns per dataset:
- •Swissmetro:per\-alternative travel time, cost, headway, availability indicators; respondent age, income, season\-ticket holding, luggage, trip purpose\.
- •LPMC:per\-alternative duration, transit and driving costs, trip distance; respondent age, sex, license, car ownership\.
- •IoT\-Wearables:per\-alternative price, functional features, security label; respondent age, education, sex, security\-behavior score, condition fixed effects\.
Cost \(or price\), time \(or duration\), and availability indicators are therefore in the input set on every dataset; this is the basis for the foundation model’s potential non\-monotonic response to cost, and motivates the fixed\-𝐪\\mathbf\{q\}intervention protocol of Section[3\.2](https://arxiv.org/html/2606.26432#S3.SS2)for the adapter\.
#### Models\.
Each cell evaluates five primary models, reported in Table[1](https://arxiv.org/html/2606.26432#S5.T1):
- •Multinomial logit \(Stage 1 of the adapter\)\.
- •Raw foundation model: Mitra\(Maddix Robinsonet al\.,[2025](https://arxiv.org/html/2606.26432#bib.bib13)\)or TabPFN\(Hollmannet al\.,[2023](https://arxiv.org/html/2606.26432#bib.bib11),[2025](https://arxiv.org/html/2606.26432#bib.bib12)\)\.
- •Architecturally constrained monotonic neural network\(Sartoret al\.,[2025](https://arxiv.org/html/2606.26432#bib.bib19)\)\.
- •Feature\-augmented multinomial logit, with𝐪\\mathbf\{q\}appended to the structural feature set\.
- •Simplified two\-stage adapterVk=Vkstruct\+gk\(𝐪\)V\_\{k\}=V^\{\\text\{struct\}\}\_\{k\}\+g\_\{k\}\(\\mathbf\{q\}\)\.
Three additional variants appear in prose only: a masked foundation model \(Swissmetro, §[5\.1](https://arxiv.org/html/2606.26432#S5.SS1)\); a convex ensembleαPMNL\+\(1−α\)PFM\\alpha P\_\{\\text\{MNL\}\}\+\(1\{\-\}\\alpha\)P\_\{\\text\{FM\}\}\(§[5\.2](https://arxiv.org/html/2606.26432#S5.SS2)\); and the two\-term variantVk=Vkstruct\+αlogqk\+gk\(𝐪\)V\_\{k\}=V^\{\\text\{struct\}\}\_\{k\}\+\\alpha\\log q\_\{k\}\+g\_\{k\}\(\\mathbf\{q\}\)\(ablation A2\)\. The correction networkggis a two\-hidden\-layer MLP, width3232, output\-layer\-zero initialization \(§[3\.3](https://arxiv.org/html/2606.26432#S3.SS3)\)\.
#### Foundation\-model context and cross\-fitted training𝐪i\\mathbf\{q\}\_\{i\}\.
Stage 2’s training\-row𝐪i\\mathbf\{q\}\_\{i\}come from ak=5k=5stratified cross\-fitted protocol, so no row’s prediction was made by a model that saw its own label; test𝐪i\\mathbf\{q\}\_\{i\}are out\-of\-context by construction\. The protocol applies to five of six \(dataset, FM\) cells; TabPFN\-LPMC exceeds the CUDA attention\-kernel ceiling at LPMC’s per\-fold context size and retains in\-sample𝐪i\\mathbf\{q\}\_\{i\}\(marked with†in Table[1](https://arxiv.org/html/2606.26432#S5.T1)\)\. Cross\-fitting shifts adapter test accuracy by at most−0\.6\-0\.6pp \(Appendix[E](https://arxiv.org/html/2606.26432#A5)\)\.
#### Metrics and significance\.
We report:
- •Test\-set accuracy\.
- •Per\-row monotonicity rate \(cluster\-aware aggregation on IoT\-Wearables\)\.
- •Trade\-off ratio: value\-of\-time on transportation datasets; willingness\-to\-pay for indicated non\-cost attributes on IoT\-Wearables\.
- •Availability leak \(Swissmetro only\)\.
- •Expected calibration error \(ECE\), post\-calibration withK=15K=15equal\-weight bins\.
For each \(dataset, foundation\-model\) cell we draw1010bootstrap samples of the training set with replacement and refit Stage 1 and Stage 2 independently on each replicate; validation and test splits are held fixed\. Significance tests on the per\-seed accuracy gain \(full adapter−\-Stage 1\) include an exact two\-sided binomial sign test \(p≈0\.002p\\approx 0\.002at10/1010/10positive seeds\) and per\-seed McNemar paired\-observation tests on the held\-out test set\. Three post\-hoc calibration methods \(scalar temperature scaling\(Guoet al\.,[2017](https://arxiv.org/html/2606.26432#bib.bib16)\), vector temperature scaling, and isotonic regression\) are fit on validation and evaluated on test\.
## 5Main results
Table 1:Headline resultsacross three datasets and two foundation models \(Mitra, TabPFN\)\. Adapter row reports the simplified two\-stage adapter \(ablation A2, paper’s primary\)\. Trade\-off units: Swissmetro CHF/hr value\-of\-time \(VOT\); LPMC GBP/hr VOT \(public\-transport / drive\); IoT\-Wearables USD willingness\-to\-pay \(function / label feature\)\. Means across1010bootstrap replicates \(std omitted\)\.†TabPFN\-LPMC adapter uses in\-sample train𝐪\\mathbf\{q\}; other adapter cells use cross\-fitted \(k=5k\{=\}5\) train𝐪\\mathbf\{q\}\(§[4](https://arxiv.org/html/2606.26432#S4)\)\. Monotonic NN does not use a foundation model \(so its Mitra and TabPFN columns are identical\) and has no closed\-form trade\-off ratio\.Boldmarks behavioral\-validity failures: monotonicity below100%100\\%, trade\-off ratio with wrong sign or implausibly large magnitude, or accuracy substantially below the multinomial\-logit baseline\. Swissmetro availability leak:<10−9\{<\}\\,10^\{\-9\}for MNL/adapter;5×10−45\{\\times\}10^\{\-4\}for raw TabPFN;2×10−32\{\\times\}10^\{\-3\}for raw Mitra\.Table[1](https://arxiv.org/html/2606.26432#S5.T1)reports per\-cell test accuracy, monotonicity rate, trade\-off ratio, availability leak, and post\-calibration expected calibration error \(ECE\) for the five models of §[4](https://arxiv.org/html/2606.26432#S4)\. The results break into three threads, mirroring the subsection structure below:
- •§[5\.1](https://arxiv.org/html/2606.26432#S5.SS1): the raw foundation models fail behavioral validity on three diagnostics \(monotonicity, trade\-off ratios, availability\)\.
- •§[5\.2](https://arxiv.org/html/2606.26432#S5.SS2): the adapter recovers behavioral validity while keeping most of the accuracy gain\. Calibration is competitive with one documented exception\.
- •§[5\.3](https://arxiv.org/html/2606.26432#S5.SS3): under counterfactual cost perturbations, the raw foundation models violate aggregate monotonicity in66of1616scenarios while the adapter never does\.
### 5\.1The foundation models fail behavioral validity
Mitra and TabPFN gain\+4\+4to\+14\+14pp of accuracy over the multinomial logit across our three datasets, but the gains come with three distinct failures of behavioral validity\.Monotonicityrates collapse on the datasets with the most predictive headroom: TabPFN is monotone in cost on only28\.9%28\.9\\%of LPMC test rows and40\.8%40\.8\\%of IoT\-Wearables; Mitra performs better on LPMC \(50\.8%50\.8\\%\) but only marginally on IoT\-Wearables \(45\.6%45\.6\\%\)\. On Swissmetro both foundation models score≥90%\\geq 90\\%but still below the multinomial logit’s mathematical100%100\\%\.Trade\-off ratioscompound the issue: TabPFN’s LPMC driving value\-of\-time is−14\.5\-14\.5GBP/hr \(wrong sign relative to MNL’s\+16\.6\+16\.6\), and TabPFN’s IoT\-Wearables function WTP is also of the wrong sign relative to MNL\. Mitra’s willingness\-to\-pay on the IoT\-Wearables binary indicators is locally inconclusive at the1%1\\%perturbation scale; we discuss the audit\-methodology limitation in Section[7](https://arxiv.org/html/2606.26432#S7)\.Availability compliancefails on Swissmetro: TabPFN assigns∼5×10−4\\sim 5\\times 10^\{\-4\}and Mitra∼2×10−3\\sim 2\\times 10^\{\-3\}of total probability to formally unavailable alternatives, against the multinomial logit’s mechanically zero leak\. Masking the foundation model \(zeroing unavailable alternatives and renormalizing\) removes the leak but leaves accuracy unchanged, so the leak is a structural error rather than an artifact of probability redistribution\.
### 5\.2The adapter recovers behavioral validity while keeping most of the accuracy gain
The adapter inherits the multinomial logit’s structural utility \(with a sign\-constrained cost coefficient\) and availability\-mask machinery, so it satisfies100%100\\%monotonicity, exact analytical trade\-off ratios, and zero availability leak by construction on every cell\. The empirical question is how much of the foundation\-model accuracy gain the adapter retains\.
The simplified adapter recovers most of it\. Across all six \(dataset, foundation\-model\) cells the adapter trails the raw foundation model by at most22pp on accuracy \(Table[1](https://arxiv.org/html/2606.26432#S5.T1)\): the adapter pays up to22pp for full preservation of the multinomial logit’s economic guarantees\. The gain over the structural multinomial logit alone is positive in1010of1010bootstrap replicates on every cell, giving an exact two\-sided binomial sign\-testpp\-value of≈0\.002\{\\approx\}\\,0\.002per cell, and per\-seed McNemar paired\-observation tests\(McNemar,[1947](https://arxiv.org/html/2606.26432#bib.bib24)\)\(which compare per\-example correctness between two classifiers on the same test set\) yieldingpp\-values below10−2010^\{\-20\}on the larger datasets \(LPMC, IoT\-Wearables\) and below10−1010^\{\-10\}on the smaller datasets\. The trade\-off ratios match the structural multinomial logit by construction \(Proposition[1](https://arxiv.org/html/2606.26432#Thmproposition1)\) and are stable across seeds:84\.4±2\.684\.4\\pm 2\.6CHF/hr on Swissmetro,1\.75±0\.051\.75\\pm 0\.05and16\.64±0\.3316\.64\\pm 0\.33GBP/hr on LPMC public transport and driving, and\+26\.10±0\.91\+26\.10\\pm 0\.91USD willingness\-to\-pay for IoT\-Wearables functional features\.
The feature\-augmented multinomial logit reaches slightly higher accuracy than the adapter \(\+0\.5\+0\.5to\+2\.6\+2\.6pp depending on the cell\) but at the cost of degraded trade\-off\-ratio estimates that drift from the structural multinomial logit’s\. On Swissmetro the drift is egregious: feat\-aug VOT is1508±7381508\\pm 738CHF/hr for the Mitra\-augmented variant and678±319678\\pm 319for the TabPFN\-augmented variant, an order of magnitude beyond any plausible willingness\-to\-pay range\. On LPMC and IoT\-Wearables the drift is more modest but still nontrivial: LPMC feat\-aug\-Mitra reports10\.310\.3/6\.76\.7GBP/hr against MNL’s1\.81\.8/16\.616\.6\(5×5\{\\times\}drift on public transport,0\.4×0\.4\{\\times\}on drive\); IoT\-Wearables feat\-aug reports1414–1515/55USD against MNL’s26\.126\.1/10\.010\.0\. The adapter, by contrast, inherits MNL’s trade\-off ratios exactly under the fixed\-𝐪\\mathbf\{q\}protocol \(Proposition[1](https://arxiv.org/html/2606.26432#Thmproposition1)\), so this drift is bounded to zero by construction\. The convex\-ensemble baseline,αPMNL\+\(1−α\)PFM\\alpha\\,P\_\{\\text\{MNL\}\}\+\(1\-\\alpha\)\\,P\_\{\\text\{FM\}\}, fits anα\\alphanear zero on every cell where the foundation model has higher validation accuracy, reducing to the raw foundation model up to numerical noise\. Both baselines confirm that linear blending does not recover the adapter’s combination of accuracy and interpretability\.
#### Calibration\.
Post\-temperature\-scaling ECE is competitive with the structural multinomial logit on LPMC and IoT\-Wearables \(Table[1](https://arxiv.org/html/2606.26432#S5.T1)\); Swissmetro is the exception, where adapter ECE stays at8\.4%8\.4\\%\(Mitra\) and17\.0%17\.0\\%\(TabPFN\) against the multinomial logit’s4\.0%4\.0\\%even after scalar / vector temperature scaling, isotonic regression, and bootstrap ensembling, a bias\-driven rather than variance\-driven limitation we discuss in Section[7](https://arxiv.org/html/2606.26432#S7)\. Full NLL, Brier, and uncalibrated ECE in Appendix[D](https://arxiv.org/html/2606.26432#A4)mirror this pattern\.
### 5\.3Counterfactual aggregate\-share evaluation
To test how each model would predict aggregate demand response to a small price increase, we evaluate every model on the held\-out test set under a\+10%\+10\\%cost perturbation applied separately to each alternative, then aggregate predicted probabilities into a market share for the perturbed alternative\. Table[2](https://arxiv.org/html/2606.26432#S5.T2)reports the full result matrix\.
Table 2:Counterfactual aggregate\-share change \(pp\) under\+10%\+10\\%cost perturbation\. Adapter: mean across1010replicates \(std<0\.15<0\.15pp\)\.Boldmarks monotonicity violations \(Δ\\Deltashare\>0\>0under a cost*increase*\)\.Across1616\(dataset, foundation\-model, alternative\) scenarios the structural multinomial logit and the adapter agree on the direction of the share change in every scenario: both predict a strictly negative change in aggregate share for the perturbed alternative\. Adapter share\-changes recover4040–97%97\\%of the multinomial logit’s magnitude depending on cell \(Table[2](https://arxiv.org/html/2606.26432#S5.T2)\); the gap reflects softmax saturation under the foundation\-model correction, not a violation of Proposition[1](https://arxiv.org/html/2606.26432#Thmproposition1)\.
The raw foundation\-model row uses the recomputed\-𝐪\\mathbf\{q\}protocol \(re\-running the foundation model on perturbed inputs\) and shows the per\-observation monotonicity violations of Section[5\.1](https://arxiv.org/html/2606.26432#S5.SS1)propagating to policy\-relevant aggregates: in66of1616scenarios the raw foundation model predicts a positive aggregate\-share change under a cost*increase*\(bolded in Table[2](https://arxiv.org/html/2606.26432#S5.T2)\)\. The largest violation \(TabPFN\-IoT\-alt1\) reverses sign relative to both MNL and the adapter, an1111\-pp disagreement on consumer response to a10%10\\%price increase\. The structural multinomial logit and the adapter are exempt by construction\.
## 6Ablations
We run five ablations using the same multi\-seed bootstrap protocol as Section[5](https://arxiv.org/html/2606.26432#S5)\.
Table 3:Ablation summary: test accuracy \(%\), mean±\\pmstd across1010bootstrap replicates\. A4 \(graceful degradation under foundation\-model context restriction\) has a different shape \(one row per context fraction\) and is plotted separately in Figure[1](https://arxiv.org/html/2606.26432#S6.F1)\.A1: log\-only\.Vk=Vkstruct\+αlogqkV\_\{k\}=V^\{\\text\{struct\}\}\_\{k\}\+\\alpha\\log q\_\{k\}replaces the neural correction with the scalar log\-probability term alone\. Competitive on the transportation datasets but trails by11–33pp accuracy gain on the new datasets, where the foundation model’s predictions encode patterns a single scalar projection cannot capture\.
A2: simplified architecture \(paper’s primary\)\.Vk=Vkstruct\+gk\(𝐪\)V\_\{k\}=V^\{\\text\{struct\}\}\_\{k\}\+g\_\{k\}\(\\mathbf\{q\}\)drops theαlogqk\\alpha\\log q\_\{k\}term\. Bit\-close numbers vs the two\-term variant across all six cells; the simplification comes at no empirical cost\.
A3: joint training\.Training𝜷\\boldsymbol\{\\beta\}andggjointly under the same likelihood gives accuracy comparable to or slightly higher than the two\-stage adapter, but the structural cost coefficient collapses on the transportation datasets by a factor of33–17×17\\timesasg\(𝐪\)g\(\\mathbf\{q\}\)absorbs the cost\-induced variation\. The collapse does not reproduce on IoT\-Wearables, where the jointβcost\\beta\_\{\\text\{cost\}\}is∼25%\\sim 25\\%larger; we attribute this to whether cost is smoothly recoverable from𝐪\\mathbf\{q\}\(Appendix[B\.1](https://arxiv.org/html/2606.26432#A2.SS1)\)\.
A4: degraded foundation model\.We retrained Mitra and TabPFN on50%50\\%,25%25\\%, and10%10\\%of the Swissmetro train\+val context, using the samek=5k=5cross\-fitted training\-𝐪i\\mathbf\{q\}\_\{i\}protocol as the headline cells\. Accuracy gain decreases roughly linearly with context fraction, with10/1010/10bootstrap replicates positive at every level \(Figure[1](https://arxiv.org/html/2606.26432#S6.F1)\)\. The adapter falls back toward Stage 1 but does not collapse below it\.
A5: capacity sweep\.Varying the correction network’s hidden width \(1616,3232,6464\) and depth \(11,22layers\) keeps accuracy within11pp of A2’s across all cells\. The gain comes from the composition of structural utility plus foundation\-model correction, not from correction\-network capacity\.
Table[3](https://arxiv.org/html/2606.26432#S6.T3)summarizes test accuracy across all five variants\. Four of the five preserve100%100\\%monotonicity by construction; A3 is the empirical illustration of Proposition[2](https://arxiv.org/html/2606.26432#Thmproposition2)\.
Figure 1:Graceful degradation \(A4\)on Swissmetro: adapter accuracy gain over Stage 1 \(pp\) as the foundation\-model context fraction is reduced\. One panel per foundation model; the dotted line marks the abstract’s≥6\\geq 6pp claim\. Markers are means across1010bootstrap replicates with95%95\\%CI \(cross\-fitted protocol\)\.
## 7Discussion
Aggregate monotonicity violations on LPMC and IoT\-Wearables\.The per\-row failures of Section[5\.1](https://arxiv.org/html/2606.26432#S5.SS1)propagate to aggregate share predictions: raw foundation models predict a\+10%\+10\\%cost increase*raises*aggregate share in66of1616scenarios\. The structural multinomial logit and adapter are exempt by Proposition[1](https://arxiv.org/html/2606.26432#Thmproposition1)\.
Calibration on Swissmetro\.The adapter’s post\-calibration ECE stays elevated on Swissmetro \(8\.4%8\.4\\%for Mitra,17\.0%17\.0\\%for TabPFN\) against the multinomial logit’s4\.0%4\.0\\%\. Scalar / vector temperature scaling, isotonic regression, and 10\-seed bootstrap ensembling all fail to recover multinomial\-logit\-comparable calibration\. The residual error is bias\-driven \(per\-seed adapter distributions miscalibrated in the same direction\), so ensemble averaging does not help\. Calibration on the other two datasets is competitive with the multinomial logit; the Swissmetro limitation is dataset\-specific\.
Audit methodology on discrete attributes\.The audit’s1%1\\%\-perturbation recipe is sound for continuous attributes but produces float32\-rounding\-precision deltas on IoT\-Wearables’ binary indicators \(function and label\) in Mitra cells\. We use a discrete\-flip protocol instead \(set the indicator to its complement, sign\-correct the per\-row delta\), giving finite trade\-off values: function flip\-WTPs span−0\.05\-0\.05to\+0\.59\+0\.59USD; label flips span−0\.13\-0\.13to\+0\.23\+0\.23USD, with both signs appearing within a single attribute, indicating that Mitra’s per\-attribute response is weak relative to its per\-cost response\.
Proposition[2](https://arxiv.org/html/2606.26432#Thmproposition2)is dataset\-dependent\.The joint\-training cost\-coefficient collapse \(ablation A3, §[6](https://arxiv.org/html/2606.26432#S6)\) reproduces on Swissmetro and LPMC and does not on IoT\-Wearables\. The proposition is unaffected: identifiability fails when cost is recoverable from𝐪\\mathbf\{q\}as a continuous map, and the empirical strength of the failure tracks how well that condition is met\. On IoT\-Wearables the foundation\-model probability vector does not closely track per\-product price\.
Adapter inherits Stage 1’s MNL specification\.The adapter’s trade\-off ratios are inherited from𝜷∗\\boldsymbol\{\\beta\}^\{\\ast\}by construction \(Proposition[1](https://arxiv.org/html/2606.26432#Thmproposition1)\)\. If the Stage 1 multinomial logit is misspecified for the deployment population,ggcannot correct the implied trade\-off ratio becausegg’s contribution to the structural derivative is zero under the fixed\-𝐪\\mathbf\{q\}protocol\. Practitioners deploying the adapter to new domains should treat the analytical trade\-off ratio as conditional on the Stage 1 specification\.
## 8Conclusion
Discrete\-choice models guide policy decisions whose economic stakes are large and whose failure modes are not subtle\. Tabular foundation models raise predictive accuracy beyond what structural utility models alone reach, but their predictions can violate basic economic logic: monotonicity in cost, sensible willingness\-to\-pay, zero probability on unavailable alternatives\. We propose a two\-stage adapter that embeds foundation\-model predictions inside a structurally constrained utility model\. Proposition[1](https://arxiv.org/html/2606.26432#Thmproposition1)shows that the two\-stage procedure preserves the structural model’s marginal rate of substitution exactly; Proposition[2](https://arxiv.org/html/2606.26432#Thmproposition2)shows that joint training does not\. Across three discrete\-choice datasets and two foundation models, the adapter pays at most22pp of accuracy for full structural validity by construction, with the gain over the structural multinomial logit positive in1010of1010bootstrap replicates on every cell\. The architecture is model\-agnostic and slots in any predict\-function whose probability output respects the precomputed\-𝐪\\mathbf\{q\}contract\.
## References
- Discrete choice analysis: theory and application to travel demand\.MIT Press\.Cited by:[§1](https://arxiv.org/html/2606.26432#S1.p1.1),[§1](https://arxiv.org/html/2606.26432#S1.p2.1),[§2](https://arxiv.org/html/2606.26432#S2.SS0.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2606.26432#S3.SS1.p2.11)\.
- M\. Bierlaire, K\. W\. Axhausen, and G\. Abay \(2001\)The acceptance of modal innovation: the case of Swissmetro\.InProceedings of the 1st Swiss Transport Research Conference,Cited by:[§4](https://arxiv.org/html/2606.26432#S4.SS0.SSS0.Px1.p1.5)\.
- G\. Cybenko \(1989\)Approximation by superpositions of a sigmoidal function\.Mathematics of Control, Signals and Systems2\(4\),pp\. 303–314\.Cited by:[§A\.2](https://arxiv.org/html/2606.26432#A1.SS2.6.p6.7),[Proposition 2](https://arxiv.org/html/2606.26432#Thmproposition2.p1.13.13)\.
- C\. Guo, G\. Pleiss, Y\. Sun, and K\. Q\. Weinberger \(2017\)On calibration of modern neural networks\.InInternational Conference on Machine Learning,Cited by:[§4](https://arxiv.org/html/2606.26432#S4.SS0.SSS0.Px5.p1.4)\.
- Y\. Han, F\. Calara Oereuran, M\. Ben\-Akiva, and C\. Zegras \(2022\)A neural\-embedded discrete choice model: learning taste representation with strengthened interpretability\.Transportation Research Part B163,pp\. 166–186\.Cited by:[§2](https://arxiv.org/html/2606.26432#S2.SS0.SSS0.Px1.p1.1)\.
- K\. He, X\. Zhang, S\. Ren, and J\. Sun \(2015\)Delving deep into rectifiers: surpassing human\-level performance on ImageNet classification\.InProceedings of the IEEE International Conference on Computer Vision,Cited by:[item Two\-stage training\.](https://arxiv.org/html/2606.26432#S3.I1.ix2.p1.9)\.
- T\. Hillel, M\. Bierlaire, M\. Z\. E\. B\. Elshafie, and Y\. Jin \(2021\)A systematic review of machine learning classification methodologies for modelling passenger mode choice\.Journal of Choice Modelling38,pp\. 100221\.Cited by:[§1](https://arxiv.org/html/2606.26432#S1.p2.1),[§2](https://arxiv.org/html/2606.26432#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Hillel, M\. Z\. E\. B\. Elshafie, and Y\. Jin \(2018\)Recreating passenger mode choice\-sets for transport simulation: a case study of London, UK\.Proceedings of the Institution of Civil Engineers — Smart Infrastructure and Construction171\(1\),pp\. 29–42\.Cited by:[§4](https://arxiv.org/html/2606.26432#S4.SS0.SSS0.Px1.p1.5)\.
- G\. Hinton, O\. Vinyals, and J\. Dean \(2015\)Distilling the knowledge in a neural network\.arXiv preprint arXiv:1503\.02531\.Cited by:[§1](https://arxiv.org/html/2606.26432#S1.p3.1),[§2](https://arxiv.org/html/2606.26432#S2.SS0.SSS0.Px3.p1.1)\.
- N\. Hollmann, S\. Müller, K\. Eggensperger, and F\. Hutter \(2023\)TabPFN: a transformer that solves small tabular classification problems in a second\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.26432#S1.p3.1),[§2](https://arxiv.org/html/2606.26432#S2.SS0.SSS0.Px3.p1.1),[2nd item](https://arxiv.org/html/2606.26432#S4.I2.i2.p1.1)\.
- N\. Hollmann, S\. Müller, L\. Purucker, A\. Krishnakumar, M\. Körfer, S\. B\. Hoo, R\. T\. Schirrmeister, and F\. Hutter \(2025\)Accurate predictions on small data with a tabular foundation model\.Nature\.Cited by:[§1](https://arxiv.org/html/2606.26432#S1.p3.1),[§2](https://arxiv.org/html/2606.26432#S2.SS0.SSS0.Px3.p1.1),[2nd item](https://arxiv.org/html/2606.26432#S4.I2.i2.p1.1)\.
- S\. D\. Johnson, J\. M\. Blythe, M\. Manning, and G\. T\. W\. Wong \(2020\)The impact of IoT security labelling on consumer product choice and willingness to pay\.PLOS ONE15\(1\),pp\. e0227800\.External Links:[Document](https://dx.doi.org/10.1371/journal.pone.0227800)Cited by:[§4](https://arxiv.org/html/2606.26432#S4.SS0.SSS0.Px1.p1.5)\.
- D\. Maddix Robinson, J\. Yin, N\. Erickson, A\. F\. Ansari, B\. Han, S\. Zhang, L\. Akoglu, C\. Faloutsos, M\. W\. Mahoney, A\. G\. Wilson, H\. Wang, Y\. Wang, B\. Wang, and X\. Zhang \(2025\)Mitra: mixed synthetic priors for enhancing tabular foundation models\.arXiv preprint arXiv:2510\.21204\.Cited by:[§1](https://arxiv.org/html/2606.26432#S1.p3.1),[§2](https://arxiv.org/html/2606.26432#S2.SS0.SSS0.Px3.p1.1),[2nd item](https://arxiv.org/html/2606.26432#S4.I2.i2.p1.1)\.
- Q\. McNemar \(1947\)Note on the sampling error of the difference between correlated proportions or percentages\.Psychometrika12\(2\),pp\. 153–157\.Cited by:[§5\.2](https://arxiv.org/html/2606.26432#S5.SS2.p2.13)\.
- D\. Sartor, A\. Sinigaglia, and G\. A\. Susto \(2025\)Advancing constrained monotonic neural networks: achieving universal approximation beyond bounded activations\.InInternational Conference on Machine Learning,Note:PMLR 267, arXiv:2505\.02537Cited by:[§1](https://arxiv.org/html/2606.26432#S1.p3.1),[§2](https://arxiv.org/html/2606.26432#S2.SS0.SSS0.Px2.p1.1),[3rd item](https://arxiv.org/html/2606.26432#S4.I2.i3.p1.1)\.
- J\. Sill \(1997\)Monotonic networks\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.26432#S1.p3.1),[§2](https://arxiv.org/html/2606.26432#S2.SS0.SSS0.Px2.p1.1)\.
- K\. E\. Train \(2009\)Discrete choice methods with simulation\.2 edition,Cambridge University Press\.Cited by:[§1](https://arxiv.org/html/2606.26432#S1.p1.1),[§1](https://arxiv.org/html/2606.26432#S1.p2.1)\.
- S\. van Cranenburgh, S\. Wang, A\. Vij, F\. Pereira, and J\. Walker \(2022\)Choice modelling in the age of machine learning—discussion paper\.Journal of Choice Modelling42,pp\. 100340\.Cited by:[§1](https://arxiv.org/html/2606.26432#S1.p2.1),[§2](https://arxiv.org/html/2606.26432#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Wang, X\. Sun, Y\. Li, Z\. Fan, and Z\. Zhuang \(2026\)Auditing and fixing economic validity in tabular foundation models for discrete choice\.InICML 2026 Workshop on Foundation Models for Structured Data \(FMSD\),Note:arXiv:2605\.26559External Links:[Link](https://arxiv.org/abs/2605.26559)Cited by:[§2](https://arxiv.org/html/2606.26432#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Wehenkel and G\. Louppe \(2019\)Unconstrained monotonic neural networks\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.26432#S1.p3.1),[§2](https://arxiv.org/html/2606.26432#S2.SS0.SSS0.Px2.p1.1)\.
- X\. Zhao, X\. Yan, A\. Yu, and P\. Van Hentenryck \(2020\)Prediction and behavioral analysis of travel mode choice: a comparison of machine learning and logit models\.Travel Behaviour and Society20,pp\. 22–35\.Cited by:[§1](https://arxiv.org/html/2606.26432#S1.p2.1),[§2](https://arxiv.org/html/2606.26432#S2.SS0.SSS0.Px1.p1.1)\.
## Appendix AFull proofs of the propositions
This appendix gives the full proofs of both propositions; the main text gave shorter sketches in Section[3](https://arxiv.org/html/2606.26432#S3)\. Notation is the same as the main text:xix\_\{i\}is observationii’s feature vector,KKis the number of alternatives,ϕk\\phi\_\{k\}pulls out the attributes that enter alternativekk’s structural utility,β\\betaholds the structural coefficients,𝐪\\mathbf\{q\}is the foundation model’s predicted probability vector, andggis the correction network\. Choice probabilities are the usual softmaxPk\(x\)=exp\(Vk\(x\)\)/∑jexp\(Vj\(x\)\)P\_\{k\}\(x\)=\\exp\(V\_\{k\}\(x\)\)/\\sum\_\{j\}\\exp\(V\_\{j\}\(x\)\)\.
### A\.1Proof of Proposition[1](https://arxiv.org/html/2606.26432#Thmproposition1)
###### Proposition \(restated\)\(Restated\)\.
Under the fixed\-𝐪\\mathbf\{q\}training protocol of Section[3](https://arxiv.org/html/2606.26432#S3), where𝐪\(xi\)\\mathbf\{q\}\(x\_\{i\}\)is computed once on the unperturbed input and held fixed across optimization and counterfactual evaluation, letβ∗\\beta^\{\\ast\}be the Stage 1 maximum\-likelihood estimate of the structural coefficients and letggbe any Stage 2 parameters satisfying the precomputed\-𝐪\\mathbf\{q\}contract\. For any two attributesj,j′j,j^\{\\prime\}that enter the model only throughVkstruct\(xi\)=β⊤ϕk\(xi\)V^\{\\text\{struct\}\}\_\{k\}\(x\_\{i\}\)=\\beta^\{\\top\}\\phi\_\{k\}\(x\_\{i\}\)in the identity\-on\-the\-attribute form, and for any observationxix\_\{i\}at whichϕk\\phi\_\{k\}is differentiable inxijx\_\{ij\}andxij′x\_\{ij^\{\\prime\}\},
MRSj,j′\(xi\)≡∂Vk\(xi\)/∂xij∂Vk\(xi\)/∂xij′=βj∗βj′∗\.\\mathrm\{MRS\}\_\{j,j^\{\\prime\}\}\(x\_\{i\}\)\\;\\equiv\\;\\frac\{\\partial V\_\{k\}\(x\_\{i\}\)/\\partial x\_\{ij\}\}\{\\partial V\_\{k\}\(x\_\{i\}\)/\\partial x\_\{ij^\{\\prime\}\}\}\\;=\\;\\frac\{\\beta^\{\\ast\}\_\{j\}\}\{\\beta^\{\\ast\}\_\{j^\{\\prime\}\}\}\.
###### Proof\.
The utility isVk\(xi\)=β∗⊤ϕk\(xi\)\+gk\(𝐪\(xi\)\)V\_\{k\}\(x\_\{i\}\)=\\beta^\{\\ast\\top\}\\phi\_\{k\}\(x\_\{i\}\)\+g\_\{k\}\(\\mathbf\{q\}\(x\_\{i\}\)\)\. Differentiating inxijx\_\{ij\}:
∂Vk\(xi\)∂xij=β∗⊤∂ϕk\(xi\)∂xij\+∇𝐪gk\(𝐪\(xi\)\)⊤∂𝐪\(xi\)∂xij\.\\frac\{\\partial V\_\{k\}\(x\_\{i\}\)\}\{\\partial x\_\{ij\}\}=\\beta^\{\\ast\\top\}\\frac\{\\partial\\phi\_\{k\}\(x\_\{i\}\)\}\{\\partial x\_\{ij\}\}\+\\nabla\_\{\\mathbf\{q\}\}g\_\{k\}\\\!\\left\(\\mathbf\{q\}\(x\_\{i\}\)\\right\)^\{\\\!\\top\}\\frac\{\\partial\\mathbf\{q\}\(x\_\{i\}\)\}\{\\partial x\_\{ij\}\}\.\(2\)The fixed\-𝐪\\mathbf\{q\}protocol stores𝐪i:=𝐪\(xi\)\\mathbf\{q\}\_\{i\}:=\\mathbf\{q\}\(x\_\{i\}\)once, computed on the unperturbed input, and never re\-differentiates through it\. So∂𝐪i/∂xij=0\\partial\\mathbf\{q\}\_\{i\}/\\partial x\_\{ij\}=0by definition, and the second term in \([2](https://arxiv.org/html/2606.26432#A1.E2)\) vanishes regardless of what𝐪i\\mathbf\{q\}\_\{i\}orgghappen to be:
∂Vk\(xi\)∂xij=β∗⊤∂ϕk\(xi\)∂xij\.\\frac\{\\partial V\_\{k\}\(x\_\{i\}\)\}\{\\partial x\_\{ij\}\}=\\beta^\{\\ast\\top\}\\frac\{\\partial\\phi\_\{k\}\(x\_\{i\}\)\}\{\\partial x\_\{ij\}\}\.\(3\)For attributes that enterϕk\\phi\_\{k\}in identity\-on\-the\-attribute form,∂ϕk\(xi\)/∂xij=ej⋅𝟙\[j∈Sk\]\\partial\\phi\_\{k\}\(x\_\{i\}\)/\\partial x\_\{ij\}=e\_\{j\}\\cdot\\mathbb\{1\}\[j\\in S\_\{k\}\], whereSkS\_\{k\}is the set of indicesϕk\\phi\_\{k\}depends on\. Substituting into \([3](https://arxiv.org/html/2606.26432#A1.E3)\) gives∂Vk/∂xij=βj∗\\partial V\_\{k\}/\\partial x\_\{ij\}=\\beta^\{\\ast\}\_\{j\}whenj∈Skj\\in S\_\{k\}and zero otherwise\. Sincej,j′j,j^\{\\prime\}both enter throughϕk\\phi\_\{k\}, the ratio collapses toβj∗/βj′∗\\beta^\{\\ast\}\_\{j\}/\\beta^\{\\ast\}\_\{j^\{\\prime\}\}, independent ofxix\_\{i\}andgg\. ∎
#### Remark on the protocol\.
The proof leans on∂𝐪/∂x=0\\partial\\mathbf\{q\}/\\partial x=0, which is a property of the protocol rather than the architecture\. Under the recomputed\-𝐪\\mathbf\{q\}protocol,ggwould contribute a chain\-rule term∇𝐪gk⊤⋅∂𝐪/∂xij\\nabla\_\{\\mathbf\{q\}\}g\_\{k\}^\{\\top\}\\cdot\\partial\\mathbf\{q\}/\\partial x\_\{ij\}that has no sign or magnitude guarantee, and the trade\-off ratio is no longer tied toβj∗/βj′∗\\beta^\{\\ast\}\_\{j\}/\\beta^\{\\ast\}\_\{j^\{\\prime\}\}\. We use fixed\-𝐪\\mathbf\{q\}because it matches the audit’s intent: the analyst wants the structural part of the utility to respond to a price change while the foundation model’s per\-chooser assessment stays put, rather than re\-running the foundation model on a counterfactual feature value it never saw at pretraining time\.
### A\.2Proof of Proposition[2](https://arxiv.org/html/2606.26432#Thmproposition2)
###### Proposition \(restated\)\(Restated\)\.
Suppose:
1. \(i\)the foundation model𝐪:𝒳→ΔK−1\\mathbf\{q\}:\\mathcal\{X\}\\to\\Delta^\{K\-1\}is continuously differentiable on the support of the training distribution, with non\-vanishing partial derivative with respect to a designated cost feature on a set of positive measure;
2. \(ii\)the correctionggis drawn from a function class𝒢⊆C0\(ΔK−1,ℝK\)\\mathcal\{G\}\\subseteq C^\{0\}\(\\Delta^\{K\-1\},\\mathbb\{R\}^\{K\}\)that is dense in the continuous functions on the image of𝐪\\mathbf\{q\}in the supremum norm;
3. \(iii\)the joint negative log\-likelihoodL\(β,g\)=−1N∑ilogPyi\(xi;β,g\)L\(\\beta,g\)=\-\\frac\{1\}\{N\}\\sum\_\{i\}\\log P\_\{y\_\{i\}\}\(x\_\{i\};\\beta,g\)is minimized over\(β,g\)\(\\beta,g\)jointly, with no two\-stage constraint and no regularization on‖β‖\\\|\\beta\\\|or‖g‖\\\|g\\\|\.
Then for any structural parameter vectorβ\(0\)\\beta^\{\(0\)\}achieving joint lossL∗L^\{\\ast\}, there exists a one\-parameter family\{\(β\(c\),g\(c\)\):c∈ℝ\}\\\{\(\\beta^\{\(c\)\},g^\{\(c\)\}\):c\\in\\mathbb\{R\}\\\}withβ\(c\)\\beta^\{\(c\)\}distinct in their cost coordinate, all achieving joint lossL∗L^\{\\ast\}in the limit as the approximationg\(c\)∈𝒢g^\{\(c\)\}\\in\\mathcal\{G\}is refined\.
###### Proof\.
The strategy: for anycc, build a correctiong\(c\)g^\{\(c\)\}that exactly cancels the change you’d make to the structural cost coefficient, leaving predicted probabilities — and the loss — untouched\.
Pick a starting minimizer\(β\(0\),g\(0\)\)\(\\beta^\{\(0\)\},g^\{\(0\)\}\)\. Define a shiftedβ\(c\)\\beta^\{\(c\)\}byβcost\(c\)=βcost\(0\)−c\\beta^\{\(c\)\}\_\{\\text\{cost\}\}=\\beta^\{\(0\)\}\_\{\\text\{cost\}\}\-candβj\(c\)=βj\(0\)\\beta^\{\(c\)\}\_\{j\}=\\beta^\{\(0\)\}\_\{j\}for every other coordinate\. We build a matchingg\(c\)g^\{\(c\)\}so that choice probabilities are unchanged for everyxxand everykk:Pk\(x;β\(c\),g\(c\)\)=Pk\(x;β\(0\),g\(0\)\)P\_\{k\}\(x;\\beta^\{\(c\)\},g^\{\(c\)\}\)=P\_\{k\}\(x;\\beta^\{\(0\)\},g^\{\(0\)\}\)\. Once that holds, the joint loss is unchanged\.
The softmax is invariant to adding the same constant to everyVkV\_\{k\}, so it’s enough to show thatVk\(x;β\(c\),g\(c\)\)−Vk\(x;β\(0\),g\(0\)\)V\_\{k\}\(x;\\beta^\{\(c\)\},g^\{\(c\)\}\)\-V\_\{k\}\(x;\\beta^\{\(0\)\},g^\{\(0\)\}\)is the same acrosskkfor everyxx\.
Plugging in the shift, the structural part of that difference is
Vkstruct\(x;β\(0\)\)−Vkstruct\(x;β\(c\)\)\\displaystyle V^\{\\text\{struct\}\}\_\{k\}\(x;\\beta^\{\(0\)\}\)\-V^\{\\text\{struct\}\}\_\{k\}\(x;\\beta^\{\(c\)\}\)=\(βcost\(0\)−βcost\(c\)\)⋅costk\(x\)=c⋅costk\(x\)\.\\displaystyle=\(\\beta^\{\(0\)\}\_\{\\text\{cost\}\}\-\\beta^\{\(c\)\}\_\{\\text\{cost\}\}\)\\cdot\\mathrm\{cost\}\_\{k\}\(x\)=c\\cdot\\mathrm\{cost\}\_\{k\}\(x\)\.So if we addc⋅costk\(x\)c\\cdot\\mathrm\{cost\}\_\{k\}\(x\)togk\(0\)\(𝐪\(x\)\)g^\{\(0\)\}\_\{k\}\(\\mathbf\{q\}\(x\)\), we exactly reproduce the originalVkV\_\{k\}\. The catch:ggonly sees𝐪\\mathbf\{q\}, notxx\. So we needcostk\(x\)\\mathrm\{cost\}\_\{k\}\(x\)to be recoverable as a continuous function of𝐪\(x\)\\mathbf\{q\}\(x\)\.
Assumption \(i\) guarantees this locally\. Where𝐪\\mathbf\{q\}is continuously differentiable and its partial in cost is non\-zero, the implicit function theorem gives a continuous local inverse: a functionκk\\kappa\_\{k\}on a neighborhood withcostk\(x\)=κk\(𝐪\(x\)\)\\mathrm\{cost\}\_\{k\}\(x\)=\\kappa\_\{k\}\(\\mathbf\{q\}\(x\)\)\. Local inverses agree where neighborhoods overlap \(becausecostk\\mathrm\{cost\}\_\{k\}is single\-valued\), so they paste into a continuous global mapκk:Im\(𝐪\)→ℝ\\kappa\_\{k\}:\\mathrm\{Im\}\(\\mathbf\{q\}\)\\to\\mathbb\{R\}on \(at least\) the positive\-measure subset where the assumption holds; continuous extension to the closure preserves continuity on the compact image\.
Now leth~k\(𝐪\)=c⋅κk\(𝐪\)\\tilde\{h\}\_\{k\}\(\\mathbf\{q\}\)=c\\cdot\\kappa\_\{k\}\(\\mathbf\{q\}\); this is continuous onIm\(𝐪\)\\mathrm\{Im\}\(\\mathbf\{q\}\)\. Assumption \(ii\) says𝒢\\mathcal\{G\}is dense in continuous functions on this image \(true for sufficiently wide MLPs by universal approximation\[Cybenko,[1989](https://arxiv.org/html/2606.26432#bib.bib27)\]\), so for anyε\>0\\varepsilon\>0we can pick ag\(c,ε\)∈𝒢g^\{\(c,\\varepsilon\)\}\\in\\mathcal\{G\}withinε\\varepsilonofgk\(0\)\+h~kg^\{\(0\)\}\_\{k\}\+\\tilde\{h\}\_\{k\}in the supremum norm\.
The softmax is Lipschitz inVVunder the sup norm, so the per\-row log\-likelihood error is bounded by a constant timesε\\varepsilonuniformly acrossii\. HenceL\(β\(c\),g\(c,ε\)\)→L∗L\(\\beta^\{\(c\)\},g^\{\(c,\\varepsilon\)\}\)\\to L^\{\\ast\}asε→0\\varepsilon\\to 0, for everycc\. Gradient descent on the joint loss can therefore land at any value ofβcost\(c\)\\beta^\{\(c\)\}\_\{\\text\{cost\}\}depending on initialization — the structural cost coefficient is not identifiable from the joint loss alone\. ∎
#### Remark on assumption \(i\)\.
If𝐪\\mathbf\{q\}doesn’t react to cost on some subset \(its cost partial vanishes there\), then cost isn’t recoverable from𝐪\\mathbf\{q\}on that subset, and the implicit\-function step doesn’t extend\. You then get a weaker statement: the joint loss is flat inβcost\\beta\_\{\\text\{cost\}\}only along the parts of the support where cost is recoverable, and the empirical collapse is stronger on datasets where the foundation model has internalized cost more thoroughly\. This is the partial\- collapse regime discussed in Section[B\.1](https://arxiv.org/html/2606.26432#A2.SS1)\.
#### Remark on regularization\.
A smallL2L\_\{2\}penalty doesn’t fix the problem\. Along the family\(β\(c\),g\(c\)\)\(\\beta^\{\(c\)\},g^\{\(c\)\}\), the data loss is exactly flat inccbut the penalty termλ‖β‖2\+μ‖g‖2\\lambda\\\|\\beta\\\|^\{2\}\+\\mu\\\|g\\\|^\{2\}varies\. So the regularized minimum is decided by the penalty’s preferred point along the family, which depends on howggis parameterized and where it was initialized\. Cross\-validation doesn’t help either: validation likelihood is also flat incc\. A large penalty does restore identifiability, but only by driving everything toward zero — fit suffers\. The two\-stage procedure sidesteps the trade\-off entirely: fixingg≡0g\\equiv 0during Stage 1 is a structural constraint, not a penalty, and Stage 1 recovers the standard MNL MLE\. See Section[B\.2](https://arxiv.org/html/2606.26432#A2.SS2)for the longer version\.
## Appendix BProposition[2](https://arxiv.org/html/2606.26432#Thmproposition2): extended discussion
### B\.1Cost\-recoverability and partial collapse
Proposition[2](https://arxiv.org/html/2606.26432#Thmproposition2)’s implicit\-function step assumes the foundation model is differentiable in cost on a positive\-measure subset of the input space\. The proposition’s conclusion is then*global*non\-identifiability: an entire family of\(β,g\)\(\\beta,g\)pairs achieves the same loss\. In practice, foundation models aren’t uniformly cost\-sensitive everywhere — on some inputs𝐪\\mathbf\{q\}barely reacts to cost\. Where it doesn’t, the implicit function step doesn’t extend\.
The weaker statement is what we actually see empirically: the joint loss is flat inβcost\\beta\_\{\\text\{cost\}\}only on the parts of the support where cost is recoverable from𝐪\\mathbf\{q\}\. On Swissmetro and LPMC, where the foundation model has clearly picked up cost\-correlated structure, joint training collapses the structural cost coefficient by 3 to 17×\\timesrelative to the two\-stage estimate \(Section[6](https://arxiv.org/html/2606.26432#S6)\)\. On IoT\-Wearables, where per\-product prices aren’t closely tracked by𝐪\\mathbf\{q\}, the joint estimate doesn’t collapse and is in fact slightly larger in magnitude than the two\-stage one\. The empirical strength of the failure scales with how thoroughly the foundation model has internalized cost — just as the weaker statement predicts\.
It’s natural to ask whether the collapse is specific to a particular foundation model\. Both Mitra and TabPFN show it on Swissmetro and LPMC, with magnitudes within2×2\\timesof each other, so the driver seems to be the in\-context\-learning paradigm itself rather than architectural specifics\.
### B\.2Regularization
Proposition[2](https://arxiv.org/html/2606.26432#Thmproposition2)is stated for unregularized joint minimization\. Real training pipelines usually add a smallL2L\_\{2\}penalty, so does that fix things? Short answer: not unless the penalty is so large that it kills the fit\.
SmallL2L\_\{2\}penalty\. Along the family\(β\(c\),g\(c\)\)\(\\beta^\{\(c\)\},g^\{\(c\)\}\)from the proof, the data\-fit loss is exactly flat incc\. The penaltyλ‖𝜷‖2\+μ‖g‖2\\lambda\\\|\\boldsymbol\{\\beta\}\\\|^\{2\}\+\\mu\\\|g\\\|^\{2\}varies along the family, but its preferredccdepends on howgg’s parameter space is shaped inL2L\_\{2\}, which is itself initialization\-dependent\. Cross\-validation doesn’t help either: the validation likelihood is also flat incc\. So smallL2L\_\{2\}doesn’t restore identifiability — it just picks a regularizer\-preferred point\.
LargeL2L\_\{2\}penalty\. Now the minimizer is dominated by the regularizer:𝜷\\boldsymbol\{\\beta\}shrinks toward zero,ggstays small, identifiability is technically restored — but the resulting𝜷\\boldsymbol\{\\beta\}isn’t the MLE either, it’s just a shrunk version\. Strong enough regularization to fix identifiability also ruins the fit\.
The two\-stage procedure avoids this trade\-off\. Settingg≡0g\\equiv 0during Stage 1 is a hard structural constraint, not a soft penalty\. The Stage 1 problem is just the standard MNL likelihood, which has a unique MLE under the usual regularity conditions\. Stage 2 then fitsggon top without disturbing𝜷∗\\boldsymbol\{\\beta\}^\{\*\}\.
## Appendix CAudit methodology: discrete\-attribute trade\-off ratios
The behavioral audit perturbs each attribute by1%1\\%of its observed range and computes a finite\-difference derivative\. That works fine for continuous attributes \(Swissmetro travel time, LPMC duration, cost\)\. On IoT\-Wearables, the functional\-feature and labeling attributes are binary\{0,1\}\\\{0,1\\\}indicators; a1%1\\%perturbation off the grid is too small for float32 to register, and in the Mitra cells the deltas round to zero\.
We swapped in a discrete\-flip protocol for these: set the indicator to its complement and sign\-correct the per\-row delta by1−2Xi,attr1\-2X\_\{i,\\text\{attr\}\}so that0→10\\to 1and1→01\\to 0flips share the same denominator sign\. After the swap, functional\-feature flip\-WTPs span−0\.05\-0\.05to\+0\.59\+0\.59USD and label flip\-WTPs span−0\.13\-0\.13to\+0\.23\+0\.23USD, with both signs appearing within a single attribute — the small magnitude tells us Mitra’s per\-attribute response is weak relative to its per\-cost response, not that the audit was broken\.
## Appendix DProbabilistic\-quality metrics: NLL, Brier, uncalibrated ECE
The headline table reports post\-temperature\-scaling ECE only\. Table[4](https://arxiv.org/html/2606.26432#A4.T4)reports the broader picture: test\-set negative log\-likelihood \(NLL\), Brier score \(sum\-of\-squares against the one\-hot target, averaged across rows\), and*uncalibrated*ECE \(maximum\-confidence binning,K=15K=15equal\-width bins, no post\-hoc temperature scaling\) for the three primary models\. MNL and adapter cells are mean±\\pmstd across1010bootstrap replicates; the raw FM is deterministic\.
Three patterns stand out:
- •Raw foundation models win NLL and Brier on every cell\. They put more probability mass on the correct label than either MNL or the adapter, consistent with their accuracy advantage\.
- •Adapter ECE is worse than MNL’s on Swissmetro but better on LPMC and IoT\-Wearables\. On Swissmetro the adapter’s uncalibrated ECE sits at 11\.1 % \(Mitra\) / 18\.6 % \(TabPFN\) against MNL’s∼\\sim10 %; on LPMC and IoT\-Wearables the adapter is below MNL \(3\.03\.0vs3\.83\.8and3\.63\.6vs7\.97\.9for Mitra\)\. The Swissmetro pattern matches the Discussion: the adapter’s Swissmetro miscalibration is bias\-driven, not inherited from𝐪\\mathbf\{q\}\.
- •Raw FMs’ low uncalibrated ECE on Swissmetro and LPMC tells us the test\-row𝐪i\\mathbf\{q\}\_\{i\}values are already well\-calibrated when the foundation model has a clean out\-of\-context prediction\. So the adapter’s calibration gap on Swissmetro originates in Stage 2’s fit ofgg, not in𝐪\\mathbf\{q\}itself\.
Table 4:Probabilistic\-quality metrics on test for three primary models, per \(dataset, FM\)\. NLL: negative log\-likelihood \(lower better\)\. Brier: sum\-of\-squares Brier score \(lower better\)\. unECE \(%\): uncalibrated Expected Calibration Error withK=15K=15equal\-width bins \(lower better\)\. MNL and adapter: mean±\\pmstd across1010bootstrap replicates; raw FM deterministic\.
## Appendix ECross\-fitted training𝐪i\\mathbf\{q\}\_\{i\}: protocol and impact
The cross\-fit is straightforward 5\-fold OOF prediction\. For each foldff, we fit a fresh foundation model on the other four training folds plus the full validation set as context, then predict on foldff\. Concatenating across folds gives a training𝐪i\\mathbf\{q\}\_\{i\}where no row was predicted by a model that saw its own label\. Test𝐪i\\mathbf\{q\}\_\{i\}stays as a single fit on \(train\+val\) — test rows are out\-of\-context anyway\. Validation𝐪i\\mathbf\{q\}\_\{i\}stays in\-sample \(it’s only used for early stopping and temperature fitting, where the leakage is harmless\)\.
This protocol works on five of the six \(dataset, FM\) cells\. The sixth, TabPFN on LPMC, hit TabPFN’s CUDA attention\-kernel launch ceiling at the per\-fold context size \(∼69,000\\sim 69\{,\}000rows\)\. Every combination we tried failed: RTX 3090 and A100, cu126 and cu128 wheels, tabpfn==2\.2\.1 with autocast and with bfloat16, with and without forced FlashAttention\. We retain the in\-sample𝐪i\\mathbf\{q\}\_\{i\}from the full\-context fit for that cell only\.
#### Empirical impact\.
The in\-sample\-vs\-cross\-fit gap on*training*accuracy is large on Swissmetro \(\+9\.9\+9\.9pp for Mitra,\+19\.8\+19\.8pp for TabPFN\) and small on LPMC and IoT\-Wearables \(≤1\.0\\leq 1\.0pp\) — confirming that in\-sample𝐪i\\mathbf\{q\}\_\{i\}encodes substantial label memorization on Swissmetro\. Despite that, the adapter’s*test*accuracy moves by at most−0\.6\-0\.6pp under cross\-fitting on any cell, and on average just−0\.2\-0\.2pp across the five cross\-fitted cells\. Soggwasn’t exploiting the leakage as a shortcut — it was learning patterns that generalize\.Similar Articles
Time-Series Foundation Model Embeddings for Remaining Useful Life Estimation
This paper introduces a lightweight approach for remaining useful life estimation using frozen embeddings from the Chronos-2 time-series foundation model combined with a simple regression head, achieving superior performance on industrial sensor data compared to baseline methods.
Fine-Grained Benchmark Generation for Comprehensive Evaluation of Foundation Models
A new framework for automated benchmark generation enables fine-grained, comprehensive evaluation of foundation models with lower error rates and richer metadata, as demonstrated on ML, Corporate Finance, and Personal Finance benchmarks.
A Causal Foundation Model for Structure and Outcome Prediction
TabPFN-CFM is a causal foundation model that predicts both causal structure and outcomes from observational data, supporting all three levels of Pearl's Causal Hierarchy and achieving improved performance over baselines.
Embeddings for Preferences, Not Semantics
This paper introduces a new embedding model designed to capture preferential similarity rather than just semantic similarity, improving preference prediction for collective decision-making systems.
LLMs on Tabular Data with Limited Semantics: Evidence from Industrial Car Retrofit Prediction
This paper evaluates LLM-based strategies (embedding, prompt, hybrid) against classical tabular models on an industrial car retrofit prediction dataset with hashed categorical features. It finds that tree ensembles outperform LLMs overall, but embeddings and hybrid approaches remain useful, while direct prompting fails without semantic cues.