LLMs on Tabular Data with Limited Semantics: Evidence from Industrial Car Retrofit Prediction
Summary
This paper evaluates LLM-based strategies (embedding, prompt, hybrid) against classical tabular models on an industrial car retrofit prediction dataset with hashed categorical features. It finds that tree ensembles outperform LLMs overall, but embeddings and hybrid approaches remain useful, while direct prompting fails without semantic cues.
View Cached Full Text
Cached at: 06/16/26, 11:41 AM
# LLMs on Tabular Data with Limited Semantics: Evidence from Industrial Car Retrofit Prediction
Source: [https://arxiv.org/html/2606.15314](https://arxiv.org/html/2606.15314)
Aina Vila Pons1,2Ioannis Tzachristas1Constantinos Antoniou1 1Chair of Transportation Systems Engineering, Technical University of Munich, Germany 2BMW Group, Munich, Germany
###### Abstract
Industrial retrofit planning depends on structured operational data rather than free text: planners must estimate whether a newly registered prototype will require a retrofit, which retrofit package it will need, and how long the work will take\. We study an industrial dataset linking a prototype\-registration system \(284,271 vehicles\) with a retrofit\-management system \(48,716 cleaned visits\), and compare strong tabular machine learning baselines with three LLM\-based strategies on row\-serialized inputs: embedding features \(Amazon Titan\), direct prompted classification \(Claude Sonnet 4\), and an ML\+LLM stacking approach\. Across binary occurrence prediction, 15\-way retrofit\-type classification, per\-visit duration regression, and an aggregated monthly benchmark, classical tree ensembles remain the strongest standalone models\. However, the LLM results reveal a consistent pattern: embeddings remain useful on tables \(binary AUC=0\.982=0\.982\), direct prompting collapses once semantic signal is stripped by hashing \(binary AUC=0\.500=0\.500; multiclass weighted F1=0\.018=0\.018\), and hybrid stacking yields the best manually built multiclass model \(weighted F1=0\.626=0\.626\)\. On the monthly benchmark, lag\-based machine learning outperforms time\-series foundation models, though Chronos\-small remains competitive in zero\-shot forecasting\. The results suggest that on privacy\-constrained industrial tables, LLMs are more effective as complementary components than as replacements for strong tabular baselines\.
LLMs on Tabular Data with Limited Semantics: Evidence from Industrial Car Retrofit Prediction
Aina Vila Pons1,2Ioannis Tzachristas1Constantinos Antoniou11Chair of Transportation Systems Engineering, Technical University of Munich, Germany2BMW Group, Munich, Germany
††footnotetext:This research was conducted as part of the industrial Master’s Thesis of Aina Vila Pons at the Technical University of Munich, in collaboration with BMW Group, under the co\-supervision of her BMW manager, Mr\. Guido Klöss\.## 1Introduction
Retrofit departments in automotive development modify prototype vehicles by implementing hardware and software changes so that downstream testing can proceed\. Capacity planning is difficult because early vehicle metadata is high\-dimensional, mostly categorical, and noisy, while the retrofit demand signal is rare\. In practice, planners need answers to three questions:*will*a newly registered prototype visit the retrofit department,*what*retrofit package will it need, and*how long*will the retrofit likely take?
We study that pipeline through the lens of LLMs over structured data\. The inputs are tables whose categorical values are hashed, so row serialization exposes structure but very little lexical semantics\. That makes the setting useful for separating what LLM\-based methods can learn from distributional patterns alone from what they need semantic content to solve\. We compare strong tabular baselines against three LLM\-based strategies on serialized rows, and we also benchmark time\-series foundation models on an aggregated monthly planning signal\.111[https://github\.com/aina\-vila\-pons/retrofit\-forecast\-pipeline](https://github.com/aina-vila-pons/retrofit-forecast-pipeline)\.
#### Contributions\.
- •We formulate an end\-to\-end industrial planning problem over structured data: binary occurrence prediction, 15\-way retrofit\-type prediction, per\-visit duration regression, and a monthly time\-series benchmark\.
- •We compare three LLM\-based strategies for tabular prediction—embedding features, prompted classification, and ML\+LLM stacking—against strong classical baselines and an AutoML reference\.
- •We report a clear failure mode: direct prompting performs poorly, while embedding\-based features and hybrid stacking still retain useful signal\.
- •We distill deployment lessons on cost, latency, and privacy\-constrained model design for LLM systems operating over enterprise tables\.
Prototype registrations\(all vehicles\)Retrofit visits\(department cases\)Left joinlabely∈\{0,1\}y\\in\\\{0,1\\\}Inner joinretrofit\-only rowsStage 1Binary classifierp^\(retrofit\)\\hat\{p\}\(\\mathrm\{retrofit\}\)Stage 2Multiclass classifiery^type\\hat\{y\}\_\{\\mathrm\{type\}\}Stage 3Duration regressord^\\widehat\{d\}Planning outputtype×\\timesdurationworkload estimate
Figure 1:Compact view of the three\-stage planning pipeline\. LLM methods are evaluated at the classification stages by serializing each structured row into text\.
## 2Related work
#### LLMs for structured data\.
Recent work studies whether language\-model interfaces can improve prediction on tables\. One direction treats a row serialization as text and learns from its embedding representationDinhet al\.\([2022](https://arxiv.org/html/2606.15314#bib.bib18)\); Hegselmannet al\.\([2023](https://arxiv.org/html/2606.15314#bib.bib19)\)\. A second direction performs direct prompted prediction over serialized rows, optionally with few\-shot examplesHegselmannet al\.\([2023](https://arxiv.org/html/2606.15314#bib.bib19)\)\. A third direction combines LLM outputs with classical tabular models in hybrid systems\. In parallel, tabular foundation models such as TabPFN ask whether pretrained transformers can match boosted trees on structured dataHollmannet al\.\([2023](https://arxiv.org/html/2606.15314#bib.bib16)\)\.
#### Industrial tabular ML and temporal forecasting\.
For high\-cardinality industrial tables, target encodingMicci\-Barreca \([2001](https://arxiv.org/html/2606.15314#bib.bib1)\)and gradient\-boosted trees such as XGBoost, LightGBM, and CatBoost remain strong baselinesChen and Guestrin \([2016](https://arxiv.org/html/2606.15314#bib.bib4)\); Keet al\.\([2017](https://arxiv.org/html/2606.15314#bib.bib5)\); Prokhorenkovaet al\.\([2018](https://arxiv.org/html/2606.15314#bib.bib6)\)\. Rare\-event prediction further requires imbalance\-aware training and careful metric selectionSaito and Rehmsmeier \([2015](https://arxiv.org/html/2606.15314#bib.bib7)\); Chawlaet al\.\([2002](https://arxiv.org/html/2606.15314#bib.bib2)\)\. For the temporal side, we compare classical statistical forecastingBox and Jenkins \([1990](https://arxiv.org/html/2606.15314#bib.bib8)\); Hyndman and Athanasopoulos \([2021](https://arxiv.org/html/2606.15314#bib.bib9)\); Taylor and Letham \([2018](https://arxiv.org/html/2606.15314#bib.bib10)\)with recent time\-series foundation models including Chronos and TIME\-LLMAnsariet al\.\([2024](https://arxiv.org/html/2606.15314#bib.bib11)\); Jinet al\.\([2024](https://arxiv.org/html/2606.15314#bib.bib12)\)\.
## 3Data and problem formulation
### 3\.1Data sources
We use two internal systems from a large automotive manufacturer: \(1\) a*prototype\-registration system*that records all newly built prototype vehicles with early metadata such as derivative, integration step, build phase, engine type, build location, and deadlines; and \(2\) a*retrofit\-management system*that records retrofit visits with retrofit package labels and process timestamps\. All categorical values are hashed; continuous features used in this work are engineered from timestamps, fleet counts, and frequency statistics\. For LLM\-based methods, each row is additionally serialized into a key–value description that can be embedded or passed to a prompted classifier\.
### 3\.2Joins and labels
We derive two supervised datasets\. A left join produces the binary occurrence dataset \(n=284,271n\{=\}284\{,\}271\), where the target indicates whether the vehicle ever visits the retrofit department \(positive rate approximately4\.4%4\.4\\%; 12,378 positives vs\. 271,893 negatives\)\. An inner join produces the retrofit\-only dataset \(n=54,174n\{=\}54\{,\}174visits before duration filtering\), used for retrofit\-type prediction and duration regression\. After grouping 17 rare multi\-type combinations into their rarest component, the retrofit\-type task contains 15 classes\. For duration, we remove P1–P99 outliers \(503 rows\), leaving 52,507 visits\. A compact dataset overview is given in Appendix[A](https://arxiv.org/html/2606.15314#A1)\.
#### Targets\.
Stage 1: occurrence\.Given a registration rowxx, predict whether the vehicle will come for retrofit\.Stage 2: retrofit type\.For retrofit visits, predict the retrofit\-type label\.Stage 3: duration\.Predict the number of days between retrofit start and end date\. The raw duration distribution is right\-skewed \(mean=9\.2=9\.2days, median=4\.0=4\.0days\), so we apply alog\(1\+x\)\\log\(1\{\+\}x\)transform for modeling\.
#### Aggregated monthly benchmark\.
In addition to the per\-visit duration task, we construct a 76\-point monthly time series of mean retrofit duration\. This benchmark provides a coarse planning signal and a deliberately low\-data setting for time\-series foundation models\. It also lets us compare vehicle\-level supervision against a much smaller aggregated forecasting problem\.
## 4Methods
Figure[2](https://arxiv.org/html/2606.15314#S4.F2)summarizes the complete modeling pipeline\. Starting from the two operational databases, the workflow proceeds through data cleansing, feature engineering, and three supervised learning stages\. Each classification stage is evaluated with both classical machine learning models and LLM\-based alternatives, while AutoML and time\-series foundation models provide additional reference points\.
Figure 2:Complete modeling pipeline: from data ingestion through classical machine learning, LLM\-based methods, and time\-series foundation\-model benchmarks to final comparison and forecast chaining\.### 4\.1Feature engineering and classical baselines
We distinguish categorical metadata from engineered numeric signals\. Numeric features include vehicle age in months, fleet\-size counts aggregated by derivative, base type, engine type, and build location, as well as interaction terms such as age×\\timesfleet and priority×\\timesfleet\. To mitigate leakage, we exclude direct identifiers and temporal fields that would trivially reveal the target\. After target encoding, frequency encoding, and importance\-based pre\-filtering, the binary task retains 44 features and the multiclass/temporal tasks retain 50 features each\.
Across the three stages we evaluate logistic regression, Random ForestBreiman \([2001](https://arxiv.org/html/2606.15314#bib.bib3)\), Extra Trees, Gradient Boosting, XGBoost, LightGBM, and CatBoostChen and Guestrin \([2016](https://arxiv.org/html/2606.15314#bib.bib4)\); Keet al\.\([2017](https://arxiv.org/html/2606.15314#bib.bib5)\); Prokhorenkovaet al\.\([2018](https://arxiv.org/html/2606.15314#bib.bib6)\)\. For Stage 1 we also train stacking and weighted soft\-voting ensembles from the top\-performing models\. Hyperparameters are tuned with OptunaAkibaet al\.\([2019](https://arxiv.org/html/2606.15314#bib.bib20)\); for the multiclass task, the SMOTE ratio is tuned jointly with model hyperparametersChawlaet al\.\([2002](https://arxiv.org/html/2606.15314#bib.bib2)\)\.
### 4\.2LLM\-based methods
We evaluate three LLM\-based strategies at the classification stages\.
#### LLM\-Embed\.
Each structured row is serialized into a key–value text string and embedded with Amazon Titan Embed v2 \(1,024 dimensions\)\. A logistic regression classifier is trained on the resulting vectors\. Because of API cost, the embedding stage is subsampled to 5,000 training rows\.
#### LLM\-Prompted\.
Claude Sonnet 4 receives each serialized row together with task\-specific few\-shot examples and predicts the label directly\. We use direct prompted classification rather than fine\-tuning\. Because of cost and latency, prompted inference is evaluated on smaller held\-out subsets \(200–500 rows depending on task\)\.
#### LLM\-Stacked\.
A meta\-learner combines the best classical probabilities with LLM\-derived outputs\. This design tests whether LLMs are more useful as feature providers or complementary components than as end\-to\-end predictors\.
### 4\.3AutoML and time\-series baselines
AutoGluon serves as an AutoML reference across the tabular tasksEricksonet al\.\([2020](https://arxiv.org/html/2606.15314#bib.bib17)\)\. For the monthly benchmark, we compare statistical baselines, machine learning on lag features, AutoGluon on lag features, and time\-series foundation models including Chronos and TIME\-LLMAnsariet al\.\([2024](https://arxiv.org/html/2606.15314#bib.bib11)\); Jinet al\.\([2024](https://arxiv.org/html/2606.15314#bib.bib12)\)\. The benchmark is intentionally small and sparse, making it a useful stress test for zero\-shot forecasting\.
## 5Experimental setup
We use an 80/20 stratified train–test split for classification and regression, and a 16\-month holdout for the monthly time series\. We additionally compute bootstrap confidence intervals and pairwise McNemar tests; the main text reports the core performance tables, while detailed confidence intervals for the binary task are retained in Appendix[B](https://arxiv.org/html/2606.15314#A2)\.
Evaluation metrics are ROC\-AUC, PR\-AUC, and thresholded F1 for Stage 1; accuracy, weighted F1, macro F1, and one\-vs\-rest AUC for Stage 2; and MAE, RMSE, andR2R^\{2\}for Stage 3\. Because Stage 1 is heavily imbalanced, ROC\-AUC is interpreted jointly with PR\-AUCSaito and Rehmsmeier \([2015](https://arxiv.org/html/2606.15314#bib.bib7)\)\. For the monthly benchmark we report MAE andR2R^\{2\}on the 16\-month test window\.
## 6Results
### 6\.1Stage 1: occurrence prediction
[Table˜1](https://arxiv.org/html/2606.15314#S6.T1)shows the main binary results\. Hyperparameter\-tuned CatBoost achieves the strongest standalone performance \(AUC=0\.997=0\.997, F1=0\.884=0\.884\), while the voting ensemble yields the best PR\-AUC\. Among the LLM\-based methods, embeddings remain useful on rows \(AUC=0\.982=0\.982\), but direct prompting collapses to random performance \(AUC=0\.500=0\.500\)\. The hybrid stack attains F1=0\.900=0\.900on its 200\-row evaluation subset, suggesting complementary signal, but the overall picture remains clear: strong tabular models dominate as standalone predictors\.
Table 1:Stage 1 binary occurrence prediction\.†\\daggerindicates the smaller 200\-row LLM evaluation subset\. Full table in Appendix[B](https://arxiv.org/html/2606.15314#A2)\.ModelROC\-AUCPR\-AUCF1CatBoost \(HP\-tuned\)0\.9970\.9320\.884Voting ensemble0\.9970\.9370\.883LLM\-Stacked†0\.9960\.8970\.900LLM\-Embed0\.9820\.6670\.684LLM\-Prompted†0\.5000\.0450\.086AutoGluon0\.997—0\.881
### 6\.2Stage 2: retrofit\-type prediction
The multiclass task provides the clearest positive result for hybrid LLM use\. As shown in[Table˜2](https://arxiv.org/html/2606.15314#S6.T2),LLM\-Stackedreaches the best weighted F1 among the manually configured pipelines \(0\.626\), slightly ahead of Random Forest with SMOTE \(0\.621\) and XGBoost \(0\.614\)\. AutoGluon remains best overall at 0\.654 weighted F1\. Again, direct prompted classification performs poorly on the inputs, while embedding\-based features remain serviceable but clearly below the top tabular baselines\.
Table 2:Stage 2 retrofit\-type prediction\. Full table in Appendix[B](https://arxiv.org/html/2606.15314#A2)\.ModelAccF1wF1mAutoGluon0\.7020\.6540\.348LLM\-Stacked0\.6800\.6260\.281Random Forest \(SMOTE\)0\.6110\.6210\.390XGBoost \(SMOTE\)0\.6010\.6140\.380LLM\-Embed0\.4800\.5210\.327LLM\-Prompted0\.1000\.0180\.012
### 6\.3Stage 3: duration prediction and monthly benchmark
For per\-visit duration, classical ensembles still dominate: AutoGluon achieves MAE=4\.9=4\.9days and Extra Trees achieves MAE=5\.1=5\.1days\. On the monthly series, lag\-based machine learning remains best \(LightGBM and AutoGluon both reach MAE=3\.16=3\.16\), but Chronos\-small is reasonably competitive at MAE=4\.03=4\.03without task\-specific training\. TIME\-LLM performs worse than Chronos but remains in the range of simple statistical baselines\. These results suggest that for the temporal component, foundation models are promising complements but not yet replacements for compact supervised baselines\.
Table 3:Stage 3 duration results\. The full monthly benchmark is in Appendix[C](https://arxiv.org/html/2606.15314#A3)\.SettingModelMAER2R^\{2\}Per\-visitAutoGluon4\.90\.38Extra Trees5\.10\.37MonthlyLightGBM \(lags\)3\.16\-1\.90AutoGluon \(lags\)3\.16\-1\.88Chronos\-small4\.03\-0\.07TIME\-LLM4\.62\-0\.22
## 7Discussion and deployment considerations
#### Operating point selection\.
Stage 1 models achieve ROC\-AUC values of at least 0\.996, but operational utility depends on the chosen threshold\. In practice, false positives waste effort by triggering unnecessary preparation, while false negatives create capacity risk\. We therefore recommend selecting operating points using precision–recall trade\-offs rather than treating ROC\-AUC as sufficient on its own\.
#### LLM integration trade\-offs\.
The LLM results show a clear boundary\. Once categorical values are hashed, direct prompted classification loses the semantic cues that make language\-model reasoning useful\. By contrast, row embeddings still preserve co\-occurrence and frequency structure, and the stacking model can exploit complementary error patterns\. For practitioners, that means LLMs are currently more promising as components inside a broader tabular system than as end\-to\-end predictors on enterprise tables\.
The LLM components also impose real deployment costs\. Embedding 5,000 rows required minutes of API time, and prompted classification on a few hundred rows was slower still\. When target encoding and boosted trees already perform strongly, those costs need a clear return\. In this dataset, the strongest justification for LLM integration is complementary signal rather than raw leaderboard improvement\.
#### Temporal forecasting and drift\.
Chronos\-small and TIME\-LLM provide usable forecasts without task\-specific training, but the best lag\-based models remain stronger on this short series\. Prototype programs also change quickly: new derivatives, build phases, and retrofit types appear regularly\. That makes drift monitoring and periodic retraining necessary regardless of whether the downstream model is classical or foundation\-based\.
## 8Conclusion
We presented a multi\-stage industrial planning study over enterprise tables\. Across tasks, the main pattern is consistent: on privacy\-constrained structured data, direct prompted classification is brittle, embedding features remain useful, and hybrid ML\+LLM systems can improve selected tasks without displacing strong tabular baselines\. Taken together, the results suggest that current LLM methods are most effective as complementary components for structured industrial prediction\.
## References
- Optuna: a next\-generation hyperparameter optimization framework\.InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,KDD ’19,New York, NY, USA,pp\. 2623–2631\.External Links:ISBN 9781450362016,[Link](https://doi.org/10.1145/3292500.3330701),[Document](https://dx.doi.org/10.1145/3292500.3330701)Cited by:[§4\.1](https://arxiv.org/html/2606.15314#S4.SS1.p2.1)\.
- A\. F\. Ansari, L\. Stella, C\. Turkmen, X\. Zhang, P\. Mercado, H\. Shen, O\. Shchur, S\. S\. Rangapuram, S\. P\. Arango, S\. Kapoor, J\. Zschiegner, D\. C\. Maddix, H\. Wang, M\. W\. Mahoney, K\. Torkkola, A\. G\. Wilson, M\. Bohlke\-Schneider, and Y\. Wang \(2024\)Chronos: learning the language of time series\.External Links:2403\.07815,[Link](https://arxiv.org/abs/2403.07815)Cited by:[§2](https://arxiv.org/html/2606.15314#S2.SS0.SSS0.Px2.p1.1),[§4\.3](https://arxiv.org/html/2606.15314#S4.SS3.p1.1)\.
- G\. E\. P\. Box and G\. Jenkins \(1990\)Time series analysis, forecasting and control\.Holden\-Day, Inc\.,USA\.External Links:ISBN 0816211043Cited by:[§2](https://arxiv.org/html/2606.15314#S2.SS0.SSS0.Px2.p1.1)\.
- L\. Breiman \(2001\)Random forests\.Mach\. Learn\.45\(1\),pp\. 5–32\.External Links:ISSN 0885\-6125,[Link](https://doi.org/10.1023/A:1010933404324),[Document](https://dx.doi.org/10.1023/A%3A1010933404324)Cited by:[§4\.1](https://arxiv.org/html/2606.15314#S4.SS1.p2.1)\.
- N\. V\. Chawla, K\. W\. Bowyer, L\. O\. Hall, and W\. P\. Kegelmeyer \(2002\)SMOTE: synthetic minority over\-sampling technique\.J\. Artif\. Int\. Res\.16\(1\),pp\. 321–357\.External Links:ISSN 1076\-9757Cited by:[§2](https://arxiv.org/html/2606.15314#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.15314#S4.SS1.p2.1)\.
- T\. Chen and C\. Guestrin \(2016\)XGBoost: a scalable tree boosting system\.InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,KDD ’16,New York, NY, USA,pp\. 785–794\.External Links:ISBN 9781450342322,[Link](https://doi.org/10.1145/2939672.2939785),[Document](https://dx.doi.org/10.1145/2939672.2939785)Cited by:[§2](https://arxiv.org/html/2606.15314#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.15314#S4.SS1.p2.1)\.
- T\. Dinh, Y\. Zeng, R\. Zhang, Z\. Lin, M\. Gira, S\. Rajput, J\. Sohn, D\. Papailiopoulos, and K\. Lee \(2022\)LIFT: language\-interfaced fine\-tuning for non\-language machine learning tasks\.InProceedings of the 36th International Conference on Neural Information Processing Systems,NIPS ’22,Red Hook, NY, USA\.External Links:ISBN 9781713871088Cited by:[§2](https://arxiv.org/html/2606.15314#S2.SS0.SSS0.Px1.p1.1)\.
- N\. Erickson, J\. Mueller, A\. Shirkov, H\. Zhang, P\. Larroy, M\. Li, and A\. Smola \(2020\)AutoGluon\-tabular: robust and accurate automl for structured data\.External Links:2003\.06505,[Link](https://arxiv.org/abs/2003.06505)Cited by:[§4\.3](https://arxiv.org/html/2606.15314#S4.SS3.p1.1)\.
- S\. Hegselmann, A\. Buendia, H\. Lang, M\. Agrawal, X\. Jiang, and D\. Sontag \(2023\)TabLLM: few\-shot classification of tabular data with large language models\.External Links:2210\.10723,[Link](https://arxiv.org/abs/2210.10723)Cited by:[§2](https://arxiv.org/html/2606.15314#S2.SS0.SSS0.Px1.p1.1)\.
- N\. Hollmann, S\. Müller, K\. Eggensperger, and F\. Hutter \(2023\)TabPFN: a transformer that solves small tabular classification problems in a second\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2606.15314#S2.SS0.SSS0.Px1.p1.1)\.
- R\. J\. Hyndman and G\. Athanasopoulos \(2021\)Forecasting: principles and practice\.3 edition,OTexts\.External Links:[Link](https://otexts.com/fpp3/)Cited by:[§2](https://arxiv.org/html/2606.15314#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Jin, S\. Wang, L\. Ma, Z\. Chu, J\. Y\. Zhang, X\. Shi, P\. Chen, Y\. Liang, Y\. Li, S\. Pan, and Q\. Wen \(2024\)Time\-LLM: time series forecasting by reprogramming large language models\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2606.15314#S2.SS0.SSS0.Px2.p1.1),[§4\.3](https://arxiv.org/html/2606.15314#S4.SS3.p1.1)\.
- G\. Ke, Q\. Meng, T\. Finley, T\. Wang, W\. Chen, W\. Ma, Q\. Ye, and T\. Liu \(2017\)LightGBM: a highly efficient gradient boosting decision tree\.InProceedings of the 31st International Conference on Neural Information Processing Systems,NIPS’17,Red Hook, NY, USA,pp\. 3149–3157\.External Links:ISBN 9781510860964Cited by:[§2](https://arxiv.org/html/2606.15314#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.15314#S4.SS1.p2.1)\.
- D\. Micci\-Barreca \(2001\)A preprocessing scheme for high\-cardinality categorical attributes in classification and prediction problems\.SIGKDD Explor\. Newsl\.3\(1\),pp\. 27–32\.External Links:ISSN 1931\-0145,[Link](https://doi.org/10.1145/507533.507538),[Document](https://dx.doi.org/10.1145/507533.507538)Cited by:[§2](https://arxiv.org/html/2606.15314#S2.SS0.SSS0.Px2.p1.1)\.
- L\. Prokhorenkova, G\. Gusev, A\. Vorobev, A\. V\. Dorogush, and A\. Gulin \(2018\)CatBoost: unbiased boosting with categorical features\.InProceedings of the 32nd International Conference on Neural Information Processing Systems,NIPS’18,Red Hook, NY, USA,pp\. 6639–6649\.Cited by:[§2](https://arxiv.org/html/2606.15314#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.15314#S4.SS1.p2.1)\.
- T\. Saito and M\. Rehmsmeier \(2015\)The Precision\-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets\.PLoS ONE10\(3\),pp\. e0118432\.External Links:[Document](https://dx.doi.org/10.1371/journal.pone.0118432)Cited by:[§2](https://arxiv.org/html/2606.15314#S2.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2606.15314#S5.p2.2)\.
- S\. J\. Taylor and B\. Letham \(2018\)Forecasting at scale\.The American Statistician72\(1\),pp\. 37–45\.External Links:[Document](https://dx.doi.org/10.1080/00031305.2017.1380080)Cited by:[§2](https://arxiv.org/html/2606.15314#S2.SS0.SSS0.Px2.p1.1)\.
## Appendix AAdditional dataset overview
Table 4:Dataset overview and derived learning tasks\.Dataset / taskRowsVehiclesRegistration system \(raw\)284,271284,271Retrofit system \(cleaned\)48,71611,830Stage 1: binary \(left join\)284,271284,271Stage 2: multiclass \(inner join\)54,174—Stage 3: duration \(inner join subset\)52,507—
## Appendix BFull classification tables
Tables[5](https://arxiv.org/html/2606.15314#A2.T5)and[6](https://arxiv.org/html/2606.15314#A2.T6)provide the complete binary and multiclass comparisons, including the stronger manual baselines omitted from the main text for space\.
Table 5:Full binary occurrence results \(Stage 1\)\. F1 is reported at the optimal threshold\.Model / approachROC\-AUCPR\-AUCF1@optAUC CICatBoost \(HP\-tuned\)0\.9970\.9320\.884\[0\.997, 0\.997\]Voting ensemble \(top\-3\)0\.9970\.9370\.883\[0\.997, 0\.998\]Extra Trees \(SMOTE\)0\.9970\.9350\.880\[0\.997, 0\.998\]XGBoost \(HP\-tuned\)0\.9960\.9130\.884\[0\.996, 0\.997\]LightGBM \(HP\-tuned\)0\.9960\.9100\.883\[0\.996, 0\.997\]LLM\-Stacked0\.9960\.8970\.900\[0\.987, 1\.000\]Random Forest \(HP\-tuned\)0\.9970\.9290\.879\[0\.996, 0\.997\]LogReg \(feat\-selected\)0\.9900\.8100\.777\[0\.989, 0\.991\]LLM\-EmbedLogReg0\.9820\.6670\.684\[0\.977, 0\.986\]AutoGluon \(Ens\. L3\)0\.997—0\.881—LLM\-Prompted\(Sonnet 4\)0\.5000\.0450\.086\[0\.500, 0\.500\]
Table 6:Full multiclass retrofit\-type results \(Stage 2\)\.ModelAccF1wF1mAUCovrAutoGluon \(Ens\. L3\)0\.7020\.6540\.3480\.934LLM\-Stacked0\.6800\.6260\.2810\.888Random Forest \(SMOTE\)0\.6110\.6210\.3900\.917XGBoost \(SMOTE\)0\.6010\.6140\.3800\.914LightGBM \(SMOTE\)0\.5820\.6050\.3870\.917CatBoost \(SMOTE\)0\.5890\.6000\.3560\.906LLM\-EmbedLogReg0\.4800\.5210\.3270\.869Logistic Regression \(SMOTE\)0\.2930\.3320\.1480\.768LLM\-Prompted\(Sonnet 4\)0\.1000\.0180\.0120\.500
## Appendix CMonthly benchmark details
Table[7](https://arxiv.org/html/2606.15314#A3.T7)reports the full monthly benchmark, including statistical baselines and the complete lag\-feature comparison\.
Table 7:Aggregated monthly time\-series benchmark for retrofit duration\. 76 observations \(60 train, 16 test\)\.FamilyModelMAERMSER2R^\{2\}*Statistical baselines*SMA\(3\)4\.164\.870\.002SMA\(6\)4\.184\.870\.004SES4\.204\.88\-0\.001SBA4\.204\.88\-0\.002ARIMA\(1,1,1\)4\.194\.91\-0\.014Croston4\.224\.93\-0\.019ETS \(additive\)4\.245\.82\-0\.424*ML on lag features*LightGBM3\.163\.70\-1\.90AutoGluon3\.163\.79\-1\.88SGD3\.173\.74\-1\.74Ridge3\.223\.80\-1\.78XGBoost3\.334\.00\-2\.14MLP3\.393\.90\-2\.11Random Forest3\.443\.99\-2\.30LinearReg3\.484\.11\-2\.20*Foundation models*Chronos\-small4\.035\.04\-0\.065Chronos\-tiny4\.145\.10\-0\.091Chronos\-base4\.255\.32\-0\.187TIME\-LLM \(frozen GPT\-2\)4\.625\.56\-0\.217*Ensemble*LightGBM \+ AutoGluon \+ SGD3\.16—\-1\.84Similar Articles
LLMs for Cardiovascular Risk Prediction from Structured Clinical Data
This paper presents a hybrid framework that combines structured clinical data with LLM-generated narratives for coronary artery disease prediction, achieving high fidelity in variable extraction and comparing ML models with LLM-based zero-shot and few-shot classification.
LLM-as-a-Discriminator: When Synthetic Tables Still Look Real
This paper proposes an LLM-as-Discriminator method to audit privacy of synthetic tabular data by asking an LLM to classify samples as real or synthetic, showing that LLM discrimination can serve as a practical privacy audit signal.
LLMs Can Better Capture Human Judgments--With the Right Prompts
This paper presents simple prompting strategies that help large language models better capture the full distribution of human judgments, improving alignment on moral scenarios and beliefs. The authors show that asking models to report standard deviations and response proportions, along with ensuring scenario clarity, yields better agreement with human responses.
TabularMath: Understanding Math Reasoning over Tables with Large Language Models
TabularMath introduces a benchmark and AutoT2T framework for evaluating LLMs' mathematical reasoning over tabular data, revealing that table complexity, data quality, and modality significantly impact model performance. The study addresses a gap in LLM evaluation by systematically assessing robustness to incomplete or inconsistent table information in real-world scenarios.
Evaluating LLMs as Human Surrogates in Controlled Experiments
This paper evaluates whether off-the-shelf LLMs can reliably simulate human responses in controlled behavioral experiments by comparing LLM-generated data with human survey responses on accuracy perception. The findings show that while LLMs capture directional effects and aggregate belief-updating patterns, they do not consistently match human-scale effect magnitudes, clarifying when synthetic LLM data can serve as behavioral proxies.