ASD-Bench: A Four-Axis Comprehensive Benchmark of AI Models for Autism Spectrum Disorder

arXiv cs.LG Papers

Summary

The article introduces ASD-Bench, a comprehensive benchmark evaluating AI models for Autism Spectrum Disorder screening across four axes: predictive performance, calibration, interpretability, and robustness. It analyzes various models across different age cohorts using AQ-10 data, highlighting the importance of multi-metric evaluation in clinical AI applications.

arXiv:2605.11091v1 Announce Type: new Abstract: Automated ASD screening tools remain limited by single-architecture evaluations, axis-restricted assessment, and near-exclusive focus on adult cohorts, obscuring age-specific diagnostic patterns critical for early intervention. We introduce ASD-Bench, a systematic tabular benchmark evaluating ML, deep learning, and foundation model configurations across three age cohorts (children 1-11 yr, adolescents 12-16 yr, adults 17-64 yr) on four axes: predictive performance, calibration, interpretability, and adversarial robustness. Applied to a curated v3 dataset of 4,068 AQ-10 records, our benchmark spans classical models (XGBoost, AdaBoost, Random Forest, Logistic Regression), neural networks (MLP), deep tabular transformers (TabNet, TabTransformer, FT-Transformer), and TabPFN v2. We introduce the Heuristic Aggregate Penalty (HAP): a cost-sensitive metric penalising false negatives more heavily and incorporating cross-validation variance for deployment stability. Adult classification yields high performance (10/17 models achieve perfect F1 and AUC), while adolescents present a harder task (F1 ceiling 0.837 vs. 0.915 for children). Feature hierarchies shift across cohorts: A9 (social motivation) dominates for children, A5 (pattern recognition) leads for adolescents, and adults exhibit a flatter importance profile consistent with developmental social masking. Accuracy and calibration are dissociated: AdaBoost achieves F1=1.000 on adults with ECE=0.302, confirming single-metric evaluation is insufficient for clinical AI. Cohort-specific deployment recommendations are provided. All findings should be interpreted as proof-of-concept evidence on questionnaire-derived labels rather than clinically validated diagnostic performance.
Original Article
View Cached Full Text

Cached at: 05/13/26, 06:29 AM

# 1 Introduction
Source: [https://arxiv.org/html/2605.11091](https://arxiv.org/html/2605.11091)
ASD\-Bench: A Four\-Axis Comprehensive Benchmark of AI Models for Autism Spectrum Disorder Shubhankit Singh1,∗Hassan Shaikh1,2,†Kuldeep Raghuwanshi1,3,†Keshav Bulia1,2,† 1Research Commons AI2IIT Bombay3IIT Delhi

∗Corresponding author†Equal contribution shubhankitsingh@researchcommons\.ai

Keywords:Autism Spectrum Disorder, AQ\-10, clinical AI, tabular benchmark, AI/ML models

###### Abstract

Automated ASD screening tools remain limited by single\-architecture evaluations, axis\-restricted assessment, and near\-exclusive focus on adult cohorts, obscuring age\-specific diagnostic patterns critical for early intervention\. We introduceASD\-Bench, a systematic tabular benchmark evaluating ML, deep learning, and foundation model configurations across three age cohorts \(children 1–11 yr, adolescents 12–16 yr, adults 17–64 yr\) on four axes: predictive performance, calibration, interpretability, and adversarial robustness\. Applied to a curated v3 dataset of 4,068 AQ\-10 records, our benchmark spans classical models \(XGBoost, AdaBoost, Random Forest, Logistic Regression\), neural networks \(MLP\), deep tabular transformers \(TabNet, TabTransformer, FT\-Transformer\), and TabPFN v2\. We introduce theHeuristic Aggregate Penalty \(HAP\): a cost\-sensitive metric penalising false negatives more heavily and incorporating cross\-validation variance for deployment stability\. Adult classification yields high performance \(10/17 models achieve perfect F1 and AUC\), while adolescents present a harder task \(F1 ceiling 0\.837 vs\. 0\.915 for children\)\. Feature hierarchies shift across cohorts: A9 \(social motivation\) dominates for children, A5 \(pattern recognition\) leads for adolescents, and adults exhibit a flatter importance profile consistent with developmental social masking\. Accuracy and calibration are dissociated: AdaBoost achieves F1 = 1\.000 on adults with ECE = 0\.302, confirming single\-metric evaluation is insufficient for clinical AI\. Cohort\-specific deployment recommendations are provided\. All findings should be interpreted as proof\-of\-concept evidence on questionnaire\-derived labels rather than clinically validated diagnostic performance\.

According to the WHO Report 2021, one in every 127 people is affected by Autism Spectrum Disorder \(ASD\)World Health Organization \([2021](https://arxiv.org/html/2605.11091#bib.bib1)\), a lifelong neurodevelopmental condition characterised by persistent challenges in social communication, restricted and repetitive behaviours, and atypical sensory processingAmerican Psychiatric Association \([2013](https://arxiv.org/html/2605.11091#bib.bib2)\)\. The heterogeneity of symptom presentation is captured by the term*spectrum*, which spans mild social difficulties to severe communicative impairment\. Over the past two decades, ASD prevalence has grown significantlyLundström and others \([2015](https://arxiv.org/html/2605.11091#bib.bib3)\), placing pressure on healthcare systems already burdened by specialist shortages and high assessment costsBaird and others \([2006](https://arxiv.org/html/2605.11091#bib.bib4)\)\. Since timely intervention substantially improves cognitive, social, and behavioural outcomesHowlin and others \([2004](https://arxiv.org/html/2605.11091#bib.bib5)\), early identification is clinically critical and need to addressed by the Autism\-Spectrum Quotient 10\-item \(AQ\-10\) instrument, which enables rapid, low\-cost first\-pass assessment suitable for primary care\.

Prior work on automated ASD screening spans three threads\. Classical machine learning methods: Support Vector Machines, Random Forests, andkk\-Nearest Neighbours applied to AQ\-10 questionnaire dataThabtah \([2019](https://arxiv.org/html/2605.11091#bib.bib10),[2017](https://arxiv.org/html/2605.11091#bib.bib11)\); Allisonet al\.\([2012](https://arxiv.org/html/2605.11091#bib.bib12)\)which remain competitive on tabular inputs, while deep learning extensions to neuroimagingBayram and others \([2021](https://arxiv.org/html/2605.11091#bib.bib6)\); Heinsfeld and others \([2017](https://arxiv.org/html/2605.11091#bib.bib7)\); Eslamiet al\.\([2021](https://arxiv.org/html/2605.11091#bib.bib29)\); Konget al\.\([2019](https://arxiv.org/html/2605.11091#bib.bib9)\); Liet al\.\([2022](https://arxiv.org/html/2605.11091#bib.bib30)\), eye\-trackingFanget al\.\([2020](https://arxiv.org/html/2605.11091#bib.bib8)\), and videoTariqet al\.\([2018](https://arxiv.org/html/2605.11091#bib.bib28)\)achieve strong within dataset accuracy but require specialised acquisition hardware unavailable in routine clinical settings\. In parallel, modern deep tabular architectures have emerged, TabNetArık and Pfister \([2021](https://arxiv.org/html/2605.11091#bib.bib13)\)with sequential attention and instance\-wise feature masks, TabTransformerHuang and others \([2020](https://arxiv.org/html/2605.11091#bib.bib14)\)and FT\-TransformerGorishniy and others \([2021](https://arxiv.org/html/2605.11091#bib.bib15)\)applying self\-attention over feature embeddings, and TabPFN v2Hollmann and others \([2023](https://arxiv.org/html/2605.11091#bib.bib16)\)performing in\-context learning via a prior\-data fitted network, yet no ASD study has systematically compared them in a unified, highest unification have done inNithya and Sivasankaran \([2025](https://arxiv.org/html/2605.11091#bib.bib33)\)which did the Lime interprretability and heurtisc categorization to propose the educational plan for students with no caliberation measures\. Finally, clinical deployment demands evaluation beyond accuracy: calibrationGuo and others \([2017](https://arxiv.org/html/2605.11091#bib.bib21)\), robustness to distributional shiftDeGrave and others \([2021](https://arxiv.org/html/2605.11091#bib.bib22)\), and interpretability via SHAPLundberg and Lee \([2017](https://arxiv.org/html/2605.11091#bib.bib17)\)and LIMERibeiroet al\.\([2016](https://arxiv.org/html/2605.11091#bib.bib18)\)are all essential, yet no existing ASD study formalises a composite metric that penalises misclassification asymmetrically and rewards stability across these axes\.

These threads expose three concrete gaps\. Most studies rely on*single\-architecture evaluation*, comparing only one or two model families and omitting recent deep tabular learners on behavioural questionnaire data; they offer*axis\-limited assessment*, reporting accuracy or F1 alone while neglecting calibration, interpretability, and robustness; and they typically focus on a*single age cohort*almost exclusively adults or children masking age\-specific diagnostic patterns\. Adolescents in particular remain under\-studied despite exhibiting distinct feature hierarchies and a harder classification task \(F1 ceiling of 0\.837 vs 0\.915 for children\)\.

This study addresses these gaps through four contributions\. First, we construct Dataset v3 by combining UCI AQ\-10 data \(v1\) with a supplementary source \(v2\), yielding 4,068 records across three cohorts after deduplication and quality control\. Second, we conduct a 17\-model systematic benchmark encompassing classical ensembles, neural networks, and deep tabular transformers each with baseline and hyperparameter\-tuned variants plus a foundation model\. Third, we establish a four\-axis evaluation framework covering predictive performance, calibration, interpretability, and adversarial robustness\. Finally, we introduce thehap\(Heuristic Aggregate Penalty\) metric: a clinically motivated composite score incorporating asymmetric FN/FP penalties and a cross\-fold variance term\.

## 2 Dataset and Preprocessing

Our v3 dataset \(Table[1](https://arxiv.org/html/2605.11091#S2.T1)\) integrates two sources\. The primary source \(v1\) is the UCI Machine Learning Repository ASD Screening datasetThabtah \([2017](https://arxiv.org/html/2605.11091#bib.bib11)\), containing AQ\-10 questionnaire responses \(Table[2](https://arxiv.org/html/2605.11091#S2.T2)\) and demographics for adult, adolescent, and child participants\. The secondary source \(v2\) is an additional ASD screening dataset from the University of Arkansas, Department of Computer ScienceGrizan \([2024](https://arxiv.org/html/2605.11091#bib.bib34)\), providing supplementary records with an identical AQ\-10 structure\. Although both data versions were collected through the ASDTest app, we identified distinct data points across sources and combined them through a two\-stage preprocessing pipeline: the first stage performed deduplication by removing records shared between v1 and v2, while the second stage carried out data cleaning and quality control on a per\-cohort basis \(refer to the dataset link in the*Data and Code Availability*section\)\. After processing, the final dataset contains4,068 recordsacross three age cohorts\.

The merged v3 corpus is near\-balanced overall \(52\.5% ASD\-positive\), with a gender distribution of 67\.6% male and 32\.4% female\. Ethnicity composition is: White European 30\.4%, Asian 27\.8%, Middle Eastern 17\.7%, South Asian 9\.0%, and Black 4\.1%\. We note that fairness analysis of the ASD label across these sub\-categories is beyond the scope of the present study\.

Table 1:ASD\-Bench v3 dataset composition and quality summary \(post all cleaning steps\)\.CohortFinal RecordsAge RangeRemoved DuplicatesaASD YES / NO \(post\-clean\)Child2,5141–11 yr6 \(2\.1%\)≈\\approx60% / 40%Adolescent81812–16 yr0 \(0\.0%\)≈\\approx53% / 47%Adult73617–64 yr380 \(54\.0%\)≈\\approx26% / 74%Combined4,0681–64 yr386 total52\.5% / 47\.5%
Table 2:AQ\-10 questionnaire items encoded as binary features A1–A10\.⋆\\star= reverse\-scored\. A threshold of≥6\\geq 6suggests elevated autistic traitsIDAQ\-10 QuestionScoringA1I prefer to do things the same way over and over again\.1 = AgreeA2I find it hard to make small talk\.1 = AgreeA3I would rather go to a party than a museum\.1 = Disagree⋆A4I get highly upset if my routine is disrupted\.1 = AgreeA5I notice patterns in things all the time\.1 = AgreeA6I frequently don’t know how to keep a conversation going\.1 = AgreeA7When reading a story, I find it hard to understand characters’ intentions\.1 = AgreeA8I find it easy to work out what someone is thinking by looking at their face\.1 = Disagree⋆A9I enjoy social chit\-chat\.1 = Disagree⋆A10I find it easy to understand what others are thinking\.1 = Disagree⋆
## 3 Methodology

### 3\.1 Models and Configurations

We evaluate four categories of models, each trained in both a default \(baseline\) and hyperparameter\-tuned configuration viaGridSearchCVunless stated otherwise\.

Classical models\.We include Logistic RegressionCox \([1958](https://arxiv.org/html/2605.11091#bib.bib32)\), Random ForestBreiman \([2001](https://arxiv.org/html/2605.11091#bib.bib31)\), AdaBoostFreund and Schapire \([1997](https://arxiv.org/html/2605.11091#bib.bib20)\), and XGBoostChen and Guestrin \([2016](https://arxiv.org/html/2605.11091#bib.bib19)\)as well\-established tabular baselines\. A fully\-connected Multi\-Layer Perceptron \(MLP\) implemented in scikit\-learn is included as a shallow neural baseline\.

Deep tabular architectures\.Three attention\-based architectures are evaluated\. TabNetArık and Pfister \([2021](https://arxiv.org/html/2605.11091#bib.bib13)\)uses sequential entmax masking \(batch size 64, virtual batch 32, early\-stopping patience 10\)\. TabTransformerHuang and others \([2020](https://arxiv.org/html/2605.11091#bib.bib14)\)applies column\-wise self\-attention withdmodel=32d\_\{\\mathrm\{model\}\}=32, 3 layers, and 8 heads\. FT\-TransformerGorishniy and others \([2021](https://arxiv.org/html/2605.11091#bib.bib15)\)uses a CLS\-token classification head withdmodel=32d\_\{\\mathrm\{model\}\}=32\. To quantify predictive uncertainty, Monte Carlo Dropout \(T=20T=20stochastic forward passes\) is applied to all three transformer variants at inference time\.

Foundation model\.TabPFN v2Hollmann and others \([2023](https://arxiv.org/html/2605.11091#bib.bib16)\)is a prior\-data fitted network pre\-trained on synthetic task distributions that performs in\-context learning at inference time: the target training set is passed as context without any gradient\-based fine\-tuning or hyperparameter sweep\. We use 8 estimators with CPU inference\. Because TabPFN receives no task\-specific optimisation while all other models undergo full hyperparameter search, any direct ranking comparison is inherently asymmetric and favours the tuned models; TabPFN results should be read as a lower bound on foundation\-model performance for this task\.

### 3\.2 Training Protocol

All models were trained independently per cohort using a fixed random seed \(42\)\. A stratified 80/20 train–test split was applied globally before any training; the held\-out 20% serves as the final evaluation set for all reported metrics with conducting 5\-fold cross validation on the train dataset\. Deep models were optimised with Adam for up to 100 epochs, with early stopping \(patience = 10\) within each fold\.

### 3\.3 Four\-Axis Evaluation Framework

#### 3\.3\.1 Axis 1 — Predictive Performance

Accuracy, precision, recall, F1\-score, and AUC\-ROC on the held\-out test set per cohort\. These standard classification metrics quantify discriminative performance per cohort\.

#### 3\.3\.2 Axis 2 — Calibration / Uncertainty Estimation

Expected Calibration Error \(ECE\), Brier score, mean confidence, confidence standard deviation, and mean prediction entropy\. MC Dropout \(T=20T=20\) is applied to transformer variants for epistemic uncertainty estimation\.

MetricFormulaRangePreferredMeasuresMean Confidence1n​∑max⁡\(pi\)\\dfrac\{1\}\{n\}\\displaystyle\\sum\\max\(p\_\{i\}\)\[0,1\]\[0,1\]HigherAverage certaintyStd ConfidenceVar​\(max⁡\(pi\)\)\\sqrt\{\\mathrm\{Var\}\(\\max\(p\_\{i\}\)\)\}\[0,0\.5\]\[0,0\.5\]LowerConsistencyEntropy−∑p​log⁡\(p\)\-\\displaystyle\\sum p\\log\(p\)\[0,1\]\[0,1\]LowerUncertaintyBrier Score1n​∑\(pi−yi\)2\\dfrac\{1\}\{n\}\\displaystyle\\sum\(p\_\{i\}\-y\_\{i\}\)^\{2\}\[0,1\]\[0,1\]LowerAccuracy \+ CalibrationECE∑\|Bm\|n​\|acc−conf\|\\displaystyle\\sum\\frac\{\|B\_\{m\}\|\}\{n\}\\,\|\\,\\mathrm\{acc\}\-\\mathrm\{conf\}\\,\|\[0,1\]\[0,1\]LowerCalibration qualityTable 3:Calibration and uncertainty metrics used in 2nd Axis evaluation\.
#### 3\.3\.3 Axis 3 — Interpretability

SHAP TreeExplainer \(XGBoost, RF\), SHAP DeepExplainer/GradientExplainer \(MLP\), permutation importance \(AdaBoost, TabTransformer, FT\-Transformer and TabPFN\), Logistic Regression coefficients, and TabNet built\-in feature masks\. A*consensus importance*is computed by averaging normalised scores across all 17 applicable models, with tool selection determined by model architecture \(gradient availability for neural networks, tree structure for ensembles\)\.

#### 3\.3\.4 Axis 4 — Robustness Testing

Three adversarial perturbation protocols on the test set:

1. 1\.Feature Flip:randomly flipk∈\{10%,20%,30%\}k\\in\\\{10\\%,20\\%,30\\%\\\}of binary feature values\.
2. 2\.Gaussian Noise:add/subtractε∼𝒩​\(0,σ2\)\\varepsilon\\sim\\mathcal\{N\}\(0,\\sigma^\{2\}\),σ∈\{0\.1,0\.2,0\.3\}\\sigma\\in\\\{0\.1,0\.2,0\.3\\\}, clipped to\[0,1\]\[0,1\]\.
3. 3\.Feature Removal:zero\-out the top\-k∈\{1,2,3\}k\\in\\\{1,2,3\\\}most important features\.

Composite robustness score:R=1−Δ​acc¯R=1\-\\overline\{\\Delta\\text\{acc\}\}across all perturbation levels\. This axis assesses each model’s capacity to maintain predictions under noisy or erroneous input conditions \(whether due to data entry errors or intentional misreporting\)\. Additionally, given that ASD is better characterised as a construct space than a binary category, robustness testing reveals how models handle borderline cases through Gaussian noise degradation\.

### 3\.4 HAP: Heuristic Aggregate Penalty

Standard accuracy and F1 metrics treat false positives \(FP\) and false negatives \(FN\) symmetrically\. In clinical ASD screening, a false negative \(missing a genuine case\) deprives a child of early intervention, while a false positive leads only to an unnecessary follow\-up referral\. We formalise this asymmetry through theHeuristic Aggregate Penalty\(hap\)\.

##### Cost function\.

Given a confusion matrix\{TP,TN,FP,FN\}\\\{\\mathrm\{TP\},\\mathrm\{TN\},\\mathrm\{FP\},\\mathrm\{FN\}\\\}and penalty weightswFP=2w\_\{\\mathrm\{FP\}\}=2,wFN=10w\_\{\\mathrm\{FN\}\}=10\(withwTP=wTN=0w\_\{\\mathrm\{TP\}\}=w\_\{\\mathrm\{TN\}\}=0\), the per\-fold weighted cost is

𝒞k=wFP⋅FPk\+wFN⋅FNkNk,\\mathcal\{C\}\_\{k\}\\;=\\;\\frac\{w\_\{\\mathrm\{FP\}\}\\cdot\\mathrm\{FP\}\_\{k\}\+w\_\{\\mathrm\{FN\}\}\\cdot\\mathrm\{FN\}\_\{k\}\}\{N\_\{k\}\},\(1\)whereNkN\_\{k\}is the fold sample count\. The ratiowFN:wFP=5:1w\_\{\\mathrm\{FN\}\}:w\_\{\\mathrm\{FP\}\}=5:1is*conservative*relative to the cost effectiveness; a sensitivity analysis over the rangewFN:wFP∈\[1,20\]w\_\{\\mathrm\{FN\}\}:w\_\{\\mathrm\{FP\}\}\\in\[1,\\,20\]confirms that model rankings are stable across this sweep \(W=0\.995W=0\.995,0\.9960\.996, and1\.0001\.000for the adolescent, child, and adult cohorts respectively, Appendix\), so conclusions are not driven by the specific choice\.

##### Stability\-penalised aggregation\.

hapaggregates𝒞k\\mathcal\{C\}\_\{k\}over stratifiedKK= 5\-fold cross\-validation with an explicit variance penalty:

hap=1K​∑k=1K𝒞k⏟mean cost\+λ⋅Vark​\(𝒞k\)⏟instability,\\textsc\{hap\}\\;=\\;\\underbrace\{\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}\\mathcal\{C\}\_\{k\}\}\_\{\\text\{mean cost\}\}\\;\+\\;\\lambda\\cdot\\underbrace\{\\mathrm\{Var\}\_\{k\}\\\!\\left\(\\mathcal\{C\}\_\{k\}\\right\)\}\_\{\\text\{instability\}\},\(2\)rewarding models that are consistent across data partitions—a critical property for multi\-site clinical deployment\.

##### Principled selection ofλ\\lambda\.

Rather than fixingλ\\lambdaarbitrarily, we derive it from a discrimination signal\-to\-noise criterion\. Each modeliitraces a linear score trajectoryHAPi​\(λ\)=μi\+λ​σi2\\mathrm\{HAP\}\_\{i\}\(\\lambda\)=\\mu\_\{i\}\+\\lambda\\sigma\_\{i\}^\{2\}inλ\\lambda\-space, whereμi\\mu\_\{i\}andσi2\\sigma\_\{i\}^\{2\}are its cross\-validated mean cost and variance\. Define the*inter\-model separation*S​\(λ\)=\(N2\)−1​∑i<j\|HAPi−HAPj\|S\(\\lambda\)=\\binom\{N\}\{2\}^\{\-1\}\\sum\_\{i<j\}\|\\mathrm\{HAP\}\_\{i\}\-\\mathrm\{HAP\}\_\{j\}\|and the discrimination SNR:

SNR​\(λ\)=S​\(λ\)Vari​\[HAPi​\(λ\)\]\.\\mathrm\{SNR\}\(\\lambda\)\\;=\\;\\frac\{S\(\\lambda\)\}\{\\sqrt\{\\mathrm\{Var\}\_\{i\}\\\!\\left\[\\mathrm\{HAP\}\_\{i\}\(\\lambda\)\\right\]\}\}\.\(3\)BecauseS​\(λ\)≈A\+B​λS\(\\lambda\)\\approx A\+B\\lambda\(linear\) whileVari​\[HAPi​\(λ\)\]=C\+D​λ\+E​λ2\\mathrm\{Var\}\_\{i\}\[\\mathrm\{HAP\}\_\{i\}\(\\lambda\)\]=C\+D\\lambda\+E\\lambda^\{2\}\(quadratic, withC=Var​\(μ\)C=\\mathrm\{Var\}\(\\mu\),D=2​Cov​\(μ,σ2\)D=2\\,\\mathrm\{Cov\}\(\\mu,\\sigma^\{2\}\),E=Var​\(σ2\)E=\\mathrm\{Var\}\(\\sigma^\{2\}\)\),SNR​\(λ\)\\mathrm\{SNR\}\(\\lambda\)is unimodal and attains its maximum at:

λ∗=A​E\+A2​E2\+B​E​\(B​C−A​D\)B​E,\\lambda^\{\*\}\\;=\\;\\frac\{AE\+\\sqrt\{A^\{2\}E^\{2\}\+BE\(BC\-AD\)\}\}\{BE\},\(4\)verified byd2​SNR/d​λ2\|λ∗<0d^\{2\}\\mathrm\{SNR\}/d\\lambda^\{2\}\\big\|\_\{\\lambda^\{\*\}\}<0\. Beyondλ∗\\lambda^\{\*\}, pairwise rank crossoversλi​j×=\(μj−μi\)/\(σi2−σj2\)\\lambda\_\{ij\}^\{\\times\}=\(\\mu\_\{j\}\-\\mu\_\{i\}\)/\(\\sigma\_\{i\}^\{2\}\-\\sigma\_\{j\}^\{2\}\)begin to proliferate, causing rankings to reflect variance differences rather than mean performance—a dispersive regime analogous to gain\-induced instability in linear control systems\.λ=1\.0\\lambda=1\.0lies within the stable discrimination lobe, achieving98\.698\.6–99\.4%99\.4\\%of the theoretical maximum SNR while remaining an interpretable unit\-consistent choice \(μ\\muandσ2\\sigma^\{2\}share the same cost scale\)\.

## 4 Results

### 4\.1 Predictive Performance

Figures[1](https://arxiv.org/html/2605.11091#S4.F1)–[3](https://arxiv.org/html/2605.11091#S4.F3)present F1, AUC\-ROC, and precision–recall distributions across all three cohorts\.

![Refer to caption](https://arxiv.org/html/2605.11091v1/figures/fig1_f1_grouped.png)Figure 1:F1 Score for all 17 models across three age cohorts \(Adult,Child,Adolescent\)\. Adults achieve perfect F1 for 10 of 17 models; child F1 peaks at 0\.915, adolescent at 0\.837\.##### Adults\.

Ten of 17 models achieve F1 = 1\.000 and AUC = 1\.000, confirming near\-perfect separability of the adult AQ\-10 feature space\. XGBoost Baseline \(F1 = 0\.962\) and TabNet Baseline \(F1 = 0\.940\) are the only notable underperformers; the latter achieves recall = 1\.000 at precision = 0\.886\.

##### Adolescents\.

F1 ranges 0\.750–0\.837, a 7\.8 percentage\-point gap below the child ceiling\. TabPFN leads with F1 = 0\.837 and AUC = 0\.900\. TabTransformer Tuned \(the best\-performing child model\) achieves only F1 = 0\.750 and AUC = 0\.816 on adolescents, confirming that the adolescent cohort presents a harder classification task\. Random Forest Tuned achieves the second\-highest adolescent F1 \(0\.808\), suggesting tree\-ensemble stability is more valuable when the feature\-label mapping is weaker\.

![Refer to caption](https://arxiv.org/html/2605.11091v1/figures/fig2_auc_grouped.png)Figure 2:AUC\-ROC for all 17 models across three age cohorts\. Adults: 11 of 17 models at AUC = 1\.000\. TabPFN achieves the highest child AUC \(0\.963\) and adolescent AUC \(0\.900\)\.![Refer to caption](https://arxiv.org/html/2605.11091v1/figures/fig3_prec_recall_scatter.png)Figure 3:Precision vs\. Recall scatter per cohort\. Diagonal dashed line: equal precision / recall\. TabNet Baseline achieves recall = 1\.000 on adults at the cost of precision = 0\.886\.
##### Children\.

F1 ranges 0\.864–0\.915\. TabTransformer Tuned leads on F1 \(0\.915\), TabPFN achieves the highest AUC \(0\.963\)\. Simpler models \(AdaBoost Baseline, Logistic Regression\) form the lower tier \(F1≈\\approx0\.867–0\.870\)\.

Table 4:F1\-score and AUC\-ROC for all 17 models across three cohorts\.⋆\\star= cohort best\.ModelCh\. F1Ch\. AUCAdo\. F1Ado\. AUCAd\. F1Ad\. AUCXGBoost Baseline0\.9100\.9580\.7920\.8570\.9620\.999XGBoost Tuned0\.9120\.9580\.7950\.8801\.0001\.000AdaBoost Baseline0\.8670\.8840\.7610\.8401\.0001\.000AdaBoost Tuned0\.9090\.9560\.7820\.8441\.0001\.000RF Baseline0\.9080\.9570\.8030\.8790\.9610\.998RF Tuned0\.9090\.9540\.8080\.8860\.9610\.997LR Baseline0\.8700\.8840\.7610\.8381\.0001\.000LR Tuned0\.8740\.8840\.7980\.8481\.0001\.000TabNet Baseline0\.8640\.9380\.7680\.8430\.9400\.999TabNet Tuned0\.8990\.9510\.7790\.8330\.9620\.999MLP Baseline0\.9010\.9490\.7970\.8760\.9610\.999MLP Tuned0\.8970\.9390\.8080\.8480\.9741\.000TabTransformer Base0\.9050\.9570\.7520\.8731\.0001\.000TabTransformer Tuned0\.915⋆0\.9570\.7500\.8161\.0001\.000FT\-Transformer Base0\.8910\.9460\.7640\.8341\.0001\.000FT\-Transformer Tuned0\.9070\.9620\.7720\.8401\.0001\.000TabPFN v20\.9110\.963⋆0\.837⋆0\.900⋆1\.0001\.000

### 4\.2 Calibration Quality

![Refer to caption](https://arxiv.org/html/2605.11091v1/figures/fig5_ece_brier_scatter.png)Figure 4:ECE vs\. Brier Score per cohort \(bottom\-left = ideal\)\. Adult on log\-log scale; child/adolescent on linear scale\. AdaBoost is a clear outlier in both panels\.Table 5:\(5a\)ECE for all models across age cohorts\.⋆\\star= cohort best;†\\dagger= critical \(ECE\>0\.12\>\\,0\.12\)\.ModelECE — AdultECE — ChildECE — AdolescentXGBoost Baseline0\.0220\.018⋆0\.018^\{\\star\}0\.078XGBoost Tuned0\.0140\.0220\.106AdaBoost Baseline0\.302†0\.302^\{\\dagger\}0\.190†0\.190^\{\\dagger\}0\.157†0\.157^\{\\dagger\}AdaBoost Tuned0\.303†0\.303^\{\\dagger\}0\.260†0\.260^\{\\dagger\}0\.176†0\.176^\{\\dagger\}RF Baseline0\.0260\.018⋆0\.018^\{\\star\}0\.067⋆0\.067^\{\\star\}RF Tuned0\.0460\.0370\.093LR Baseline0\.0430\.1180\.1180\.0530\.053LR Tuned0\.0030\.124†0\.124^\{\\dagger\}0\.184†0\.184^\{\\dagger\}TabNet Baseline0\.0420\.0560\.077TabNet Tuned0\.0360\.0410\.104MLP Baseline0\.0410\.0340\.076MLP Tuned0\.0380\.0440\.148†0\.148^\{\\dagger\}TabTransformer Base8\.3×10−58\.3\\\!\\times\\\!10^\{\-5\}0\.0440\.082TabTransformer Tuned2\.1×10−7⁣⋆2\.1\\\!\\times\\\!10^\{\-7\\,\\star\}0\.0220\.103FT\-Transformer Base1\.5×10−31\.5\\\!\\times\\\!10^\{\-3\}0\.0300\.074FT\-Transformer Tuned4\.5×10−34\.5\\\!\\times\\\!10^\{\-3\}0\.0270\.0710\.071TabPFN v23\.3×10−33\.3\\\!\\times\\\!10^\{\-3\}0\.0240\.071The threshold of ECE\>0\.12\>0\.12was chosen empirically, lying approximately one standard deviation above the mean of the observed ECE distribution and coinciding with a natural gap between the well\-calibrated cluster \(ECE≤0\.106\\leq 0\.106\) and poorly calibrated models \(ECE≥0\.118\\geq 0\.118\)\.

Table 6:\(5b\)Brier Score and Mean Confidence forcalibration\-safe models only\(no ECE\>0\.12\>\\,0\.12in any cohort from Table[5](https://arxiv.org/html/2605.11091#S4.T5); 4 models excluded: AdaBoost Baseline/Tuned, LR Tuned, MLP Tuned; LR Baseline borderline at 0\.118 and excluded as a precaution\)\. Brier Score lower is better; Mean Confidence closer to 1\.0 indicates decisive predictions\.⋆\\star= cohort best Brier\.ModelBrier AdultBrier ChildBrier AdolescentConf\. AdultConf\. ChildConf\. AdolescentXGBoost Baseline0\.0140\.0750\.1280\.9650\.8430\.782XGBoost Tuned0\.000⋆0\.000^\{\\star\}0\.0740\.1311\.0000\.8480\.794RF Baseline0\.0260\.0760\.1240\.9480\.8410\.800RF Tuned0\.0260\.0770\.118⋆0\.118^\{\\star\}0\.9470\.8390\.806TabNet Baseline0\.0420\.0920\.1330\.9390\.8090\.779TabNet Tuned0\.0270\.0820\.1390\.9600\.8200\.768MLP Baseline0\.0280\.0820\.1260\.9510\.8200\.793TabTransformer Base0\.000⋆0\.000^\{\\star\}0\.0800\.1371\.0000\.8270\.773TabTransformer Tuned0\.000⋆0\.000^\{\\star\}0\.0760\.1401\.0000\.8400\.768FT\-Transformer Base0\.000⋆0\.000^\{\\star\}0\.0780\.1361\.0000\.8330\.774FT\-Transformer Tuned0\.000⋆0\.000^\{\\star\}0\.0770\.1331\.0000\.8360\.779TabPFN v20\.000⋆0\.000^\{\\star\}0\.071⋆0\.071^\{\\star\}0\.113⋆0\.113^\{\\star\}1\.0000\.8540\.820A critical finding is that high predictive accuracy does not imply good calibration: AdaBoost achieves F1 = 1\.000 on adults with ECE = 0\.302, rendering its probability outputs untrustworthy for clinical threshold\-based decisions, while the adult cohort overall achieves the strongest calibration, with five models attaining ECE<0\.005<\\,0\.005and TabTransformer Tuned reaching2\.1×10−72\.1\\\!\\times\\\!10^\{\-7\}\. The adolescent cohort exhibits systematically weaker calibration than children: the best calibration\-safe adolescent ECE \(RF Baseline: 0\.067\) is nearly 4×\\timesworse than the best child ECE \(XGBoost Baseline: 0\.018\), and LR Tuned degrades from ECE = 0\.053 to 0\.184 after tuning, suggesting overfitting to adolescent noise\.

### 4\.3 Interpretability and Feature Attribution

![Refer to caption](https://arxiv.org/html/2605.11091v1/figures/fig6_feature_importance.png)Figure 5:Consensus feature importance \(averaged across 17 applicable models, normalised\) per cohort\.★\\bigstar= top\-ranked feature per cohort\. Note distinct hierarchies: A9 dominates children; A5 leads adolescents; adults show a flat multi\-feature profile\.![Refer to caption](https://arxiv.org/html/2605.11091v1/figures/fig13_fi_heatmap.png)Figure 6:Feature importance heatmap: cohort×\\timesAQ\-10 feature\. Bold values = cohort maximum\. The three cohorts show distinct importance profiles\.Table 7:Consensus feature importance by cohort \(⋆\\star= cohort maximum\)\.Top Feat\.AdultChildAdolescentClinical observation based onThabtah \([2017](https://arxiv.org/html/2605.11091#bib.bib11)\)A90\.7190\.240⋆0\.240^\{\\star\}0\.058Social chit\-chat – dominant child; drops in adolescentsA50\.722⋆0\.722^\{\\star\}0\.1110\.138⋆0\.138^\{\\star\}Pattern recognition — top adult & adolescent featureA40\.7080\.1800\.083Routine disruption — strong child; weaker in adolescentsA70\.6660\.1540\.118Theory\-of\-mind \(reading intentions\)A80\.6520\.1510\.092Face\-readingA30\.6180\.0370\.026Lowest importance across all cohorts##### Adults\.

Flat importance profile \(A5: 0\.722≈\\approxA9: 0\.719≈\\approxA1: 0\.716\)\. No single feature dominates; 8 different features ranked top across 17 models, reflecting multi\-faceted adult ASD diagnostic patterns\.

##### Children\.

A9 \(*“I enjoy social chit\-chat”*, reverse\-scored\) strongly dominates \(consensus importance 0\.240\), ranked top by 11 of 17 models\. A4 \(0\.180\) and A7 \(0\.154\) are secondary\. Social motivation is the primary early\-childhood ASD marker\.

##### Adolescents\.

A complete feature hierarchy shift occurs: A5 \(*“I notice patterns in things all the time”*\) leads \(0\.138\), followed by A7 \(0\.118\) and A10 \(0\.098\)\. A9 drops to 8th place \(0\.058\), a stark contrast to children\. A5 wins the top\-feature vote in 5 of 17 models; A6, A7, and A10 each win 3, and the remaining 3 models select different top features, reflecting a fragmented signal across all 17 models\. This shift from social \(A9\) to cognitive\-perceptual \(A5/A7\) features is consistent with adolescent social masking as a plausible hypothesis rather than a direct psychological measurement\.

### 4\.4 Robustness

![Refer to caption](https://arxiv.org/html/2605.11091v1/figures/fig8_robustness_bars.png)Figure 7:Robustness scores ranked by cohort\.Green:≥\\geq0\.88 \(high\); yellow: 0\.82–0\.88;red:<<0\.82\. Dashed line: score = 0\.90\.![Refer to caption](https://arxiv.org/html/2605.11091v1/figures/fig9_noise_degradation.png)Figure 8:Average accuracy drop under Gaussian noise injection\. Negative values indicate noise\-immune models \(slight regularisation benefit\)\. Transformer models degrade by about 24%; TabPFN shows a larger drop\.![Refer to caption](https://arxiv.org/html/2605.11091v1/figures/fig10_f1_vs_robustness.png)Figure 9:F1 Score vs\. Robustness tradeoff per cohort\. Dashed lines: F1 = 0\.95 and Robustness = 0\.90\. Ideal models appear top\-right\. The accuracy\-robustness tradeoff is evident across all three cohorts\.Table 8:Robustness scores and Gaussian noise degradation \(Δ\\Deltaacc\) across all three cohorts\.⋆\\star= cohort best robustness;†\\dagger= R<<0\.82 \(critical risk\)\. Negative noiseΔ\\Deltaindicates noise\-immune behaviour\. Nine mid\-tier models omitted; full results in supplementary material\.ModelRob\. Ad\.Rob\. Ch\.Rob\. Ado\.NoiseΔ\\DeltaAd\.NoiseΔ\\DeltaCh\.NoiseΔ\\DeltaAdo\.Panel A — Robustness\-Viable Models \(strong robustness; noise sensitivity\)MLP Baseline0\.933⋆0\.933^\{\\star\}0\.8930\.948−0\.007\-0\.007\+0\.008\+0\.008\+0\.003\+0\.003TabNet Baseline0\.9240\.9510\.980⋆0\.980^\{\\star\}−0\.010\-0\.010−0\.002\-0\.002\+0\.003\+0\.003RF Baseline0\.9150\.8870\.940\+0\.000\+0\.000\+0\.006\+0\.006\+0\.000\+0\.000RF Tuned0\.9220\.9000\.955\+0\.000\+0\.000\+0\.004\+0\.004\+0\.000\+0\.000TabTransformer Tuned0\.8400\.8400\.7710\.7710\.916\+0\.243\+0\.243\+0\.332\+0\.332\+0\.116\+0\.116XGBoost Tuned0\.8370\.8370\.8150\.8150\.924\+0\.243\+0\.243\+0\.344\+0\.344\+0\.152\+0\.152TabPFN v20\.8330\.8330\.763†0\.763^\{\\dagger\}0\.8150\.815\+0\.264\+0\.264\+0\.382\+0\.382\+0\.341\+0\.341Panel B — Not Viable: robustness deficits compound weaknesses across other axesAdaBoost Baseline0\.9200\.9540\.9540\.940\+0\.003\+0\.003\+0\.000\+0\.000\+0\.000\+0\.000FT\-Transformer Tuned0\.8400\.8400\.8050\.8050\.847\+0\.243\+0\.243\+0\.314\+0\.314\+0\.232\+0\.232The thresholds \(≥\\geq0\.88 green, 0\.82–0\.88 yellow,<<0\.82 red\) are empirically grounded in the bimodal distribution of noise degradation: models scoring≥\\geq0\.88 exhibit average noise degradation below 0\.02, while those below 0\.82 exceed 0\.20—an order\-of\-magnitude separation reflecting the Lipschitz\-continuity gap between piecewise\-constant tree ensembles and deep transformers, whose compositional non\-linearities amplify perturbations multiplicatively with depth\(Szegedyet al\.,[2014](https://arxiv.org/html/2605.11091#bib.bib35); Goodfellowet al\.,[2015](https://arxiv.org/html/2605.11091#bib.bib36)\), with the intermediate band capturing mixed degradation profiles warranting case\-by\-case review\. The 0\.90 dashed line marks the overfitting–robustness boundary made visible by the adult cohort \(n=148n=148\), where models achieving perfect baseline accuracy \(1\.000\) still fall into the yellow/red zone, indicating brittle decision boundaries rather than genuine generalisation\(Zhanget al\.,[2017](https://arxiv.org/html/2605.11091#bib.bib37); Grinsztajnet al\.,[2022](https://arxiv.org/html/2605.11091#bib.bib38)\)\. Table[8](https://arxiv.org/html/2605.11091#S4.T8)partitions models via cross\-axis viability rather than robustness alone, since single\-metric ranking obscures the compensatory structure governing clinical deployability: Panel[8](https://arxiv.org/html/2605.11091#S4.T8)–A contains robustness\-viable models whose noise sensitivity is externally manageable, evaluated*jointly*with F1 and ECE so that moderate robustness with strong calibration remains deployable \(e\.g\., XGBoost Tuned:R=0\.837R=0\.837, ECE=0\.022=0\.022, F1=0\.912=0\.912\); Panel[8](https://arxiv.org/html/2605.11091#S4.T8)–B excludes models on compound\-axis grounds, with AdaBoost Baseline disqualified despite competitive robustness \(0\.920/0\.954/0\.940\) by critical miscalibration in every cohort \(ECE=0\.302/0\.190/0\.157=0\.302/0\.190/0\.157, exceeding the 0\.05 clinical threshold\(Van Calsteret al\.,[2019](https://arxiv.org/html/2605.11091#bib.bib39)\)\), and FT\-Transformer Tuned excluded as Pareto\-dominated by Panel A alternatives with no compensating advantage on any remaining axis\. Nine mid\-tier models are omitted for brevity; full scores appear in the supplementary material\.

Transformer\-family noise brittleness \(∼24%\{\\sim\}24\\%degradation under Gaussian perturbation\) can be addressed without retraining via established clinical AI hardening techniques: input\-level adversarial training and certified smoothingCohenet al\.\([2019](https://arxiv.org/html/2605.11091#bib.bib23)\)improve effective robustness, while Monte Carlo Dropout and deep ensemblesLakshminarayananet al\.\([2017](https://arxiv.org/html/2605.11091#bib.bib24)\)enable uncertainty\-aware inference with automatic flagging of low\-confidence cases for human\-in\-the\-loop review, consistent with FDA SaMD guidanceU\.S\. Food and Drug Administration \([2021](https://arxiv.org/html/2605.11091#bib.bib25)\)\. Conformal risk controlAngelopoulos and Bates \([2023](https://arxiv.org/html/2605.11091#bib.bib26)\)further provides distribution\-free coverage guarantees valuable for the small\-sample adolescent cohort\. Population\-level drift can be detected through AI observatory tools that trigger recalibration when incoming distributions diverge from training, a realistic concern in multi\-site clinical rollout\.

### 4\.5 HAP Metric Results

![Refer to caption](https://arxiv.org/html/2605.11091v1/figures/fig11_hap_bars.png)Figure 10:hapmetric rankings per cohort \(lower = better\), computed via 5\-fold stratified cross\-validation withwFN=10w\_\{\\mathrm\{FN\}\}=10,wFP=2w\_\{\\mathrm\{FP\}\}=2,λ=1\.0\\lambda=1\.0\. Colours indicate performance tiers: green \(lowesthap\), yellow \(middle\), red \(highesthap\)\.##### Adults\.

Twelve of 17 models achievehap= 0\.000, meaning zero weighted misclassifications across all five cross\-validation folds, consistent with the near\-perfect F1 and AUC results on this cohort\. The five non\-zero models are ordered: MLP Tuned \(0\.018\), RF Tuned \(0\.022\), MLP Baseline \(0\.036\), TabNet Tuned \(0\.059\), and TabNet Baseline \(0\.075\)\. These are driven primarily by residual false negatives in harder folds rather than false positives, given thewFN:wFP=5:1w\_\{\\mathrm\{FN\}\}:w\_\{\\mathrm\{FP\}\}=5:1penalty structure\.

##### Children\.

haprankings diverge meaningfully from F1 rankings\. TabPFN leads \(0\.803\) despite ranking only 7th on F1 \(0\.911\), reflecting its superior probability calibration minimising costly false negatives\. TabTransformer Baseline scores worst \(1\.057\); its moderate ECE \(0\.044\) compounds FN errors into elevated cross\-fold variance\. AdaBoost Baseline ranks 14th onhap\(0\.958\) despite competitive F1, penalised for its severe miscalibration \(ECE = 0\.190\) inflating the variance termλ⋅Var\\lambda\\cdot\\mathrm\{Var\}\.

##### Adolescents\.

hapscores are scaled roughly2\.5×2\.5\\timeshigher than children, directly reflecting the lower absolute F1 \(more FNs per fold\)\. RF Baseline, XGBoost Baseline, and TabPFN tie for best \(2\.073\), the same three models that lead on raw AUC\. TabTransformer Baseline scores worst \(2\.540\), consistent with its poor adolescent F1 \(0\.752\) and moderate ECE \(0\.082\)\. Thehapranking here aligns closely with F1 rankings, suggesting that at lower performance levels, false\-negative volume dominates over variance as the primary cost driver\.

### 4\.6 Four\-Axis Summary Scorecard

![Refer to caption](https://arxiv.org/html/2605.11091v1/figures/fig14_scorecard_heatmap.png)Figure 11:Four\-axis normalised scorecard for eight key models \(A = Adult, C = Child, Ado = Adolescent\)\. All axes normalised to\[0,1\]\[0,1\]; higher = better\. Calibration plotted as1−ECE1\-\\mathrm\{ECE\}\.The radar profiles below highlight the recommended models per cohort\.

![Refer to caption](https://arxiv.org/html/2605.11091v1/figures/fig7_radar.png)Figure 12:Four\-axis radar profiles for the three recommended models:TabTransformer Tuned\(adult\),XGBoost Tuned\(child\),TabPFN v2\(adolescent\)\. Axes: F1, AUC, Calibration \(1−ECE1\-\\mathrm\{ECE\}\), Robustness, feature clarity

## 5 Discussion

Building on the quantitative results presented above, we now analyse the qualitative factors driving model performance and their clinical implications\.

### 5\.1 Child–Adolescent Performance Gap

Children and adolescents are genuinely distinct cohorts with different classification difficulty and feature hierarchies\. The adolescent F1 ceiling is 0\.837 \(TabPFN\) vs 0\.915 \(TabTransformer Tuned\) for children, a 7\.8 percentage\-point gap that persists across all 17 models\. Adolescent AUC similarly peaks at 0\.900 vs 0\.963 for children\. Beyond performance, the feature hierarchy shifts completely: A9 \(social motivation\) dominates children while A5 \(pattern recognition\) leads adolescents, with A9 dropping to 8th place\. This is clinically interpretable as social masking: adolescents learn to compensate surface social behaviours while perceptual rigidities persist\. These findings argue strongly for cohort\-specific models in clinical ASD screening rather than age\-agnostic deployment\. A larger, clinically verified adolescent\-specific dataset is needed to maximise predictive performance\.

### 5\.2 Accuracy–Robustness–Calibration Tradeoffs

A recurring pattern is strong dissociation across evaluation axes\. Transformer architectures achieve perfect F1 on adults but rank among the least robust models \(RR= 0\.833–0\.840\) with around 24% noise\-induced accuracy degradation\. AdaBoost achieves F1 = 1\.000 with ECE = 0\.302\. TabPFN achieves best AUC with worst robustness, which can be countered using external hardening methods\. Adolescents exhibit an unexpected pattern: models are*uniformly more robust*than on children \(TabNet Baseline: 0\.980 ado vs 0\.951 child; even TabPFN last at 0\.815 outperforms most child\-cohort scores\) despite lower absolute F1, suggesting a smoother decision boundary under perturbation\. For less robust models, established external hardening techniques like conformal prediction can be applied to mitigate robustness deficits\.

Table 9:Deployment recommendations by cohort and setting, with supporting metrics drawn from Tables[4](https://arxiv.org/html/2605.11091#S4.T4),[5](https://arxiv.org/html/2605.11091#S4.T5),[6](https://arxiv.org/html/2605.11091#S4.T6), and[8](https://arxiv.org/html/2605.11091#S4.T8)\.†Pair with uncertainty flagging forp∈\[0\.40,0\.60\]p\\in\[0\.40,\\,0\.60\]\.CohortSettingRecommended ModelF1AUCECERobust\.AdultControlledTabTransformer Tuned1\.0001\.0002\.1×10−72\.1\{\\times\}10^\{\-7\}0\.840NoisyMLP Baseline0\.9610\.9990\.0410\.933ChildGeneralXGBoost Tuned0\.9120\.9580\.0220\.815NoisyRF Tuned0\.9090\.9540\.0370\.900AdolescentGeneralTabPFN v2†0\.8370\.9000\.0710\.815NoisyTabNet Baseline0\.7680\.8430\.0770\.980
### 5\.3 AdaBoost Miscalibration

AdaBoost’s critical miscalibration \(ECE≈\\approx0\.190–0\.303 across cohorts\) despite competitive F1 is well\-explained by boosting’s confidence amplification\. Any clinical deployment must apply mandatory post\-hoc calibration\. We recommend treating AdaBoost F1 results as upper bounds on operational performance\.

### 5\.4 TabPFN as a Clinical Foundation Model

The comparison between TabPFN v2 and fully tuned models is deliberately asymmetric: TabPFN uses a fixed pre\-trained network with in\-context learning \(no gradient updates or hyperparameter search on the target data\) while all other models receive extensive optimisation\. Despite this disadvantage, TabPFN achieves the highest child AUC \(0\.963\) and competitive child F1 \(0\.911\)\. Its calibration is consistently strong \(ECE≈\\approx0\.003–0\.071 across cohorts\)\. A fine\-tuned TabPFN would likely close or eliminate the remaining F1 gap; this comparison is planned as future work\. The critical limitation is robustness \(ranked last in all cohorts\)\. We recommend pairing TabPFN with uncertainty monitoring and external robustness\-enhancement techniques: predictions withp∈\[0\.40,0\.60\]p\\in\[0\.40,0\.60\]should be automatically routed to human clinical review\.

### 5\.5 HAP in Clinical Context

The penalty ratiowFN:wFP=5:1w\_\{\\mathrm\{FN\}\}:w\_\{\\mathrm\{FP\}\}=5:1reflects a straightforward clinical reality: missing a genuine ASD case costs far more than triggering an unnecessary follow\-up referral\. Deployment teams can adjust this ratio to fit their setting—higherwFPw\_\{\\mathrm\{FP\}\}in resource\-constrained environments, higherwFNw\_\{\\mathrm\{FN\}\}in universal screening programmes—without changing which models come out ahead\.

The variance penaltyλ=1\.0\\lambda=1\.0serves an equally practical purpose: it down\-ranks models that are accurate on average but inconsistent across sites or patient subgroups, since a model that behaves erratically across data partitions is a deployment risk regardless of its headline score\.

Beyond evaluation,hapadmits a useful geometric interpretation: it maps the misclassification space\(FN,FP\)\(\\mathrm\{FN\},\\mathrm\{FP\}\)which are independent of the model weights and hyperparameters into a single discriminative score, tracing a trajectory through model space asλ\\lambdavaries\. This structure makeshapa natural candidate for an additional inverse reward signal in reinforcement learning settings, where the negativehapscore can guide a policy toward the optimal operating point in the\(FN,FP\)\(\\mathrm\{FN\},\\mathrm\{FP\}\)plane rather than optimising a flat accuracy surface\. Theλ\\lambda\-parameterised trajectory then acts as a curriculum, progressively penalising instability as training matures\.

In short,hapcaptures two properties that standard metrics miss: what*kind*of error a model makes, and how*consistently*it behaves\. Its strong agreement with the 4\-axis analysis recommendations suggests it reflects genuine clinical utility rather than a statistical artefact\. We recommend reportinghapalongside F1 and AUC in future clinical AI benchmarks\.

## 6 Limitations

### 6\.1 Dataset and Cohort Constraints

The v3 dataset relies entirely on self\-reported AQ\-10 responses collected through a single application \(ASDTest\), raising concerns about response bias and limited generalisability to broader clinical populations\. The adult cross\-dataset duplication rate \(54\.0%\) undermines the independence of the two combined sources: after deduplication only 148 unique adult records remain, yielding a test set too small to reliably distinguish genuine model capability from dataset simplicity or residual distributional overlap\. The near\-perfect adult scores \(10/17 models at F1 = 1\.000\) should therefore be interpreted with caution—they likely reflect the constrained sample size and limited feature diversity rather than true diagnostic task saturation\. The adolescent cohort \(818 records\) remains small; the authors acknowledge that a larger, clinically verified adolescent\-specific dataset is needed to meaningfully raise the F1 ceiling beyond 0\.837\. Furthermore, the exclusion of demographic variables \(gender, ethnicity, country of residence\) and the absence of a stratified fairness analysis limit the clinical completeness of the evaluation\. Although the dataset contains meaningful demographic diversity \(67\.6% male; five ethnic groups represented\), the per\-subgroup sample sizes are too small to support statistically meaningful stratified performance comparisons \(e\.g\., the Black subgroup comprises only 4\.1% of records\)\. A properly powered fairness evaluation would require targeted oversampling or a larger, demographically balanced collection effort\.

### 6\.2 Evaluation and Metric Limitations

A fundamental limitation is the absence of multi\-site or external validation: all models are trained and evaluated on data drawn from a single\-source distribution \(the ASDTest application\)\. No independent clinical site, alternative screening instrument, or geographically distinct population was used to verify that the reported performance transfers beyond the originating data collection context\. Consequently, all results should be treated as internal benchmark scores rather than estimates of real\-world diagnostic performance\.

Additionally, all predictive performance metrics \(F1, AUC\) are reported on a single stratified train–test split without repeated random seeds or statistical significance testing; observed differences between models may therefore reflect split\-specific variation rather than true performance gaps\. Thehapmetric, while clinically motivated, employs penalty weights that are acknowledged to be statistically approximated rather than empirically elicited from domain experts, limiting its authority as a standardised clinical measure\. Additionally, the comparison between fully hyperparameter\-tuned models and TabPFN v2 \(which performs in\-context learning on a fixed pre\-trained network without any gradient updates or hyperparameter search on the target data\) introduces an inherent asymmetry in the evaluation design that should be interpreted with caution\.

### 6\.3 Clinical Validity

The AQ\-10 is a screening instrument rather than a diagnostic tool, and all benchmark evaluations validate model predictions against questionnaire scores rather than formal clinician\-confirmed ASD diagnoses\. Consequently, even a model achieving perfect F1 on the held\-out test set provides no guarantee of real diagnostic utility in a clinical pathway\. Until prospective, multi\-site validation against clinician\-confirmed diagnoses is conducted across diverse healthcare settings, the findings presented in this work should be interpreted as proof\-of\-concept evidence rather than a demonstration of clinical deployability\.

## 7 Conclusion

We presented a four\-axis benchmark of 17 ML/DL/foundation models for ASD questionnaire screening across child, adolescent, and adult cohorts\. All results are derived from a single\-source distribution with no multi\-site or external validation; they represent internal benchmark performance and should not be interpreted as clinical generalisability estimates\.

##### Core findings\.

The adult cohort achieves near\-perfect held\-out scores \(10 of 17 models at F1 = 1\.000\), though this likely reflects the small sample size \(nn= 148 after deduplication\) and limited data diversity rather than genuine diagnostic task saturation\. Cohort\-specific feature hierarchies revealed developmental shifts in ASD phenotypic expression\. A9 \(social motivation\) emerged as the dominant feature for children \(importance 0\.240, ranking top in 11 of 17 models\), while A5 \(pattern recognition\) led for adolescents \(0\.138, with a fragmented signal spread across A5, A6, A7, and A10\)\. Adults displayed a flat multi\-feature profile with no single dominant predictor\.

##### Adolescent cohort difficulty\.

The adolescent cohort is a harder classification task, with an F1 ceiling of 0\.837 compared to 0\.915 for children\. However, adolescents show higher robustness \(TabNet Baseline: 0\.980\), suggesting a smoother decision boundary under perturbation\. These results indicate that age\-agnostic models are inadvisable, but they remain benchmark\-level evidence rather than proof of clinical diagnostic readiness\.

##### HAP contribution\.

The proposed HAP framework \(with tunable parameters\) offers a principled clinical penalty structure that complements standard metrics, with potential application for model optimisation through reinforcement learning\.

##### Deployment recommendations\.

Model recommendations vary by cohort and deployment context: TabTransformer Tuned for adults in controlled settings, MLP Baseline for adults in noisy environments, XGBoost Tuned for children, TabPFN v2 for adolescents where accuracy is the priority, and TabNet Baseline for adolescents where robustness is prioritised\. These deployment recommendations should still be interpreted cautiously because no stratified fairness analysis was run during model selection\.

##### Future work\.

Future directions include prospective multi\-site validation on independent clinical populations, a stratified fairness analysis across gender and ethnicity subgroups \(requiring a larger, demographically balanced dataset\), age\-adaptive ensemble systems, and formal HAP weight elicitation from domain experts\. On the modelling side, we plan a comparative study of tabular foundation models including a fine\-tuned TabPFN across three cohorts, working toward a medical foundation model for ASD and related neurodevelopmental conditions\. We further aim to apply symbolic regression and mechanistic interpretability to extract clinically transparent ASD concepts from learned representations\. Finally, the hard asymmetric HAP penalty will be replaced by a confidence\-weighted variant that focuses learning on high\-confidence false negatives rather than penalising all missed positives uniformly\.

## Data and Code Availability

## Ethics Statement

This study uses publicly available, de\-identified questionnaire data from the UCI Machine Learning Repository and a supplementary source\. No new human subjects data were collected\. All analyses were conducted on pre\-existing, anonymised records\. The AQ\-10 is a screening instrument, not a diagnostic tool; model outputs should not be used as a substitute for clinical assessment\.

## References

- Toward brief “red flags” for autism screening: the short autism spectrum quotient and the short quantitative checklist for autism in toddlers in 1,000 cases and 3,000 controls\.51\(2\),pp\. 202–212\.Cited by:[§1](https://arxiv.org/html/2605.11091#S1.p2.1)\.
- American Psychiatric Association \(2013\)Diagnostic and statistical manual of mental disorders, 5th edition\.American Psychiatric Publishing,Washington, DC\.Cited by:[§1](https://arxiv.org/html/2605.11091#S1.p1.1)\.
- A\. N\. Angelopoulos and S\. Bates \(2023\)Conformal risk control\.InInternational Conference on Learning Representations,Cited by:[§4\.4](https://arxiv.org/html/2605.11091#S4.SS4.p2.1)\.
- S\. Ö\. Arık and T\. Pfister \(2021\)TabNet: attentive interpretable tabular learning\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.35,pp\. 6679–6687\.Cited by:[§1](https://arxiv.org/html/2605.11091#S1.p2.1),[§3\.1](https://arxiv.org/html/2605.11091#S3.SS1.p3.3)\.
- G\. Bairdet al\.\(2006\)Prevalence of disorders of the autism spectrum in a population cohort of children in south thames\.The Lancet368\(9531\),pp\. 210–215\.Cited by:[§1](https://arxiv.org/html/2605.11091#S1.p1.1)\.
- S\. Bayramet al\.\(2021\)Deep learning methods for autism spectrum disorder diagnosis based on fmri images\.Sakarya University Journal of Computer and Information Sciences4,pp\. 142–155\.Cited by:[§1](https://arxiv.org/html/2605.11091#S1.p2.1)\.
- L\. Breiman \(2001\)Random forests\.45\(1\),pp\. 5–32\.Cited by:[§3\.1](https://arxiv.org/html/2605.11091#S3.SS1.p2.1)\.
- T\. Chen and C\. Guestrin \(2016\)XGBoost: a scalable tree boosting system\.InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,pp\. 785–794\.Cited by:[§3\.1](https://arxiv.org/html/2605.11091#S3.SS1.p2.1)\.
- J\. M\. Cohen, E\. Rosenfeld, and J\. Z\. Kolter \(2019\)Certified adversarial robustness via randomised smoothing\.InInternational Conference on Machine Learning,pp\. 1310–1320\.Cited by:[§4\.4](https://arxiv.org/html/2605.11091#S4.SS4.p2.1)\.
- D\. R\. Cox \(1958\)The regression analysis of binary sequences\.20\(2\),pp\. 215–232\.Cited by:[§3\.1](https://arxiv.org/html/2605.11091#S3.SS1.p2.1)\.
- A\. J\. DeGraveet al\.\(2021\)AI for radiographic COVID\-19 detection selects shortcuts over signal\.3,pp\. 610–619\.Cited by:[§1](https://arxiv.org/html/2605.11091#S1.p2.1)\.
- T\. Eslami, F\. Almuqhim, J\. S\. Raiker, and F\. Saeed \(2021\)Machine learning methods for diagnosing autism spectrum disorder and attention\-deficit/hyperactivity disorder using functional and structural MRI: a survey\.14,pp\. 575999\.External Links:[Document](https://dx.doi.org/10.3389/fninf.2020.575999)Cited by:[§1](https://arxiv.org/html/2605.11091#S1.p2.1)\.
- Y\. Fang, H\. Duan, F\. Shi, X\. Min, and G\. Zhai \(2020\)Identifying children with autism spectrum disorder based on gaze\-following\.In2020 IEEE International Conference on Image Processing \(ICIP\),Vol\.,pp\. 423–427\.External Links:[Document](https://dx.doi.org/10.1109/ICIP40778.2020.9190831)Cited by:[§1](https://arxiv.org/html/2605.11091#S1.p2.1)\.
- Y\. Freund and R\. E\. Schapire \(1997\)A decision\-theoretic generalization of on\-line learning and an application to boosting\.55\(1\),pp\. 119–139\.Cited by:[§3\.1](https://arxiv.org/html/2605.11091#S3.SS1.p2.1)\.
- I\. J\. Goodfellow, J\. Shlens, and C\. Szegedy \(2015\)Explaining and harnessing adversarial examples\.In3rd International Conference on Learning Representations, ICLR 2015,San Diego, CA, USA\.External Links:1412\.6572,[Link](https://arxiv.org/abs/1412.6572)Cited by:[§4\.4](https://arxiv.org/html/2605.11091#S4.SS4.p1.8)\.
- Y\. Gorishniyet al\.\(2021\)Revisiting deep learning models for tabular data\.InAdvances in Neural Information Processing Systems,Vol\.34,pp\. 18932–18943\.Cited by:[§1](https://arxiv.org/html/2605.11091#S1.p2.1),[§3\.1](https://arxiv.org/html/2605.11091#S3.SS1.p3.3)\.
- L\. Grinsztajn, E\. Oyallon, and G\. Varoquaux \(2022\)Why do tree\-based models still outperform deep learning on typical tabular data?\.InAdvances in Neural Information Processing Systems 35 \(NeurIPS 2022\) Datasets and Benchmarks Track,New Orleans, LA, USA\.External Links:2207\.08815,[Link](https://arxiv.org/abs/2207.08815)Cited by:[§4\.4](https://arxiv.org/html/2605.11091#S4.SS4.p1.8)\.
- A\. Grizan \(2024\)ASD questionnaires – final\.Kaggle\.External Links:[Link](https://www.kaggle.com/datasets/afarinbargrizan/asd-final/data)Cited by:[§2](https://arxiv.org/html/2605.11091#S2.p1.1)\.
- C\. Guoet al\.\(2017\)On calibration of modern neural networks\.InInternational Conference on Machine Learning,pp\. 1321–1330\.Cited by:[§1](https://arxiv.org/html/2605.11091#S1.p2.1)\.
- A\. S\. Heinsfeldet al\.\(2017\)Identification of autism spectrum disorder using deep learning and the ABIDE dataset\.NeuroImage: Clinical17,pp\. 16–23\.Cited by:[§1](https://arxiv.org/html/2605.11091#S1.p2.1)\.
- N\. Hollmannet al\.\(2023\)TabPFN: a transformer that solves small tabular classification problems in a second\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.11091#S1.p2.1),[§3\.1](https://arxiv.org/html/2605.11091#S3.SS1.p4.1)\.
- Howlinet al\.\(2004\)Adult outcome for children with autism\.Journal of Child Psychology and Psychiatry45\(2\),pp\. 212–229\.Cited by:[§1](https://arxiv.org/html/2605.11091#S1.p1.1)\.
- X\. Huanget al\.\(2020\)TabTransformer: tabular data modeling using contextual embeddings\.Cited by:[§1](https://arxiv.org/html/2605.11091#S1.p2.1),[§3\.1](https://arxiv.org/html/2605.11091#S3.SS1.p3.3)\.
- Y\. Kong, J\. Gao, Y\. Xu, Y\. Pan, J\. Wang, and J\. Liu \(2019\)Classification of autism spectrum disorder by combining brain connectivity and deep neural network classifier\.NeurocomputingInformatics for Health and Social CareJournal of the American Academy of Child & Adolescent PsychiatryarXiv preprint arXiv:2012\.06678Journal of Computer and System SciencesNature Machine IntelligencePLOS MedicineFrontiers in NeuroinformaticsMachine LearningJournal of the Royal Statistical Society: Series B \(Methodological\)AutismBMC Medicine324,pp\. 63–68\.Note:Deep Learning for Biological/Clinical DataExternal Links:ISSN 0925\-2312,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.neucom.2018.04.080),[Link](https://www.sciencedirect.com/science/article/pii/S0925231218306234)Cited by:[§1](https://arxiv.org/html/2605.11091#S1.p2.1)\.
- B\. Lakshminarayanan, A\. Pritzel, and C\. Blundell \(2017\)Simple and scalable predictive uncertainty estimation using deep ensembles\.InAdvances in Neural Information Processing Systems,Vol\.30\.Cited by:[§4\.4](https://arxiv.org/html/2605.11091#S4.SS4.p2.1)\.
- M\. X\. Li, S\. Tu, S\. u\. Rehman, and Y\. Yang \(2022\)An improved dynamic functional connectivity and deep neural network model for autism spectrum disorder classification\.InProceedings of the 6th International Conference on Deep Learning Technologies \(ICDLT\),pp\. 37–41\.External Links:[Document](https://dx.doi.org/10.1145/3556677.3556694)Cited by:[§1](https://arxiv.org/html/2605.11091#S1.p2.1)\.
- S\. M\. Lundberg and S\. Lee \(2017\)A unified approach to interpreting model predictions\.InAdvances in Neural Information Processing Systems,Vol\.30\.Cited by:[§1](https://arxiv.org/html/2605.11091#S1.p2.1)\.
- S\. Lundströmet al\.\(2015\)Autism phenotypes and educational performance\.Journal of Child Psychology and Psychiatry56\(5\),pp\. 567–576\.Cited by:[§1](https://arxiv.org/html/2605.11091#S1.p1.1)\.
- A\. Nithya and V\. Sivasankaran \(2025\)Autism spectrum disorder–level prediction and personalized education planning using TabNet\.pp\. 1–11\.External Links:[Document](https://dx.doi.org/10.1177/13623613251375199),[Link](https://doi.org/10.1177/13623613251375199)Cited by:[§1](https://arxiv.org/html/2605.11091#S1.p2.1)\.
- M\. T\. Ribeiro, S\. Singh, and C\. Guestrin \(2016\)“Why should I trust you?”: explaining the predictions of any classifier\.InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,pp\. 1135–1144\.Cited by:[§1](https://arxiv.org/html/2605.11091#S1.p2.1)\.
- C\. Szegedy, W\. Zaremba, I\. Sutskever, J\. Bruna, D\. Erhan, I\. J\. Goodfellow, and R\. Fergus \(2014\)Intriguing properties of neural networks\.In2nd International Conference on Learning Representations, ICLR 2014,Banff, AB, Canada\.External Links:1312\.6199,[Link](https://arxiv.org/abs/1312.6199)Cited by:[§4\.4](https://arxiv.org/html/2605.11091#S4.SS4.p1.8)\.
- Q\. Tariq, J\. Daniels, J\. N\. Schwartz, P\. Washington, H\. Kalantarian, and D\. P\. Wall \(2018\)Mobile detection of autism through machine learning on home video: a development and prospective validation study\.15\(11\),pp\. e1002705\.External Links:[Document](https://dx.doi.org/10.1371/journal.pmed.1002705)Cited by:[§1](https://arxiv.org/html/2605.11091#S1.p2.1)\.
- F\. Thabtah \(2017\)Autism spectrum disorder screening dataset\.Note:UCI Machine Learning RepositoryExternal Links:[Link](https://archive.ics.uci.edu/ml/datasets/Autism+Screening+Adult)Cited by:[§1](https://arxiv.org/html/2605.11091#S1.p2.1),[§2](https://arxiv.org/html/2605.11091#S2.p1.1),[Table 7](https://arxiv.org/html/2605.11091#S4.T7.5.4.5.1.1.1)\.
- F\. Thabtah \(2019\)Machine learning in autistic spectrum disorder behavioural research: a review\.44\(3\),pp\. 278–297\.Cited by:[§1](https://arxiv.org/html/2605.11091#S1.p2.1)\.
- U\.S\. Food and Drug Administration \(2021\)Artificial intelligence/machine learning \(AI/ML\)\-based software as a medical device \(SaMD\) action plan\.Note:[https://www\.fda\.gov/media/145022/download](https://www.fda.gov/media/145022/download)Cited by:[§4\.4](https://arxiv.org/html/2605.11091#S4.SS4.p2.1)\.
- B\. Van Calster, D\. J\. McLernon, M\. van Smeden, L\. Wynants, and E\. W\. Steyerberg \(2019\)Calibration: the Achilles heel of predictive analytics\.17\(1\),pp\. 230\.External Links:[Document](https://dx.doi.org/10.1186/s12916-019-1466-7),[Link](https://bmcmedicine.biomedcentral.com/articles/10.1186/s12916-019-1466-7)Cited by:[§4\.4](https://arxiv.org/html/2605.11091#S4.SS4.p1.8)\.
- World Health Organization \(2021\)Autism spectrum disorders\.Note:[https://www\.who\.int/news\-room/fact\-sheets/detail/autism\-spectrum\-disorders](https://www.who.int/news-room/fact-sheets/detail/autism-spectrum-disorders)Cited by:[§1](https://arxiv.org/html/2605.11091#S1.p1.1)\.
- C\. Zhang, S\. Bengio, M\. Hardt, B\. Recht, and O\. Vinyals \(2017\)Understanding deep learning requires rethinking generalization\.In5th International Conference on Learning Representations, ICLR 2017,Toulon, France\.External Links:1611\.03530,[Link](https://arxiv.org/abs/1611.03530)Cited by:[§4\.4](https://arxiv.org/html/2605.11091#S4.SS4.p1.8)\.

Similar Articles

AutoMedBench: Towards Medical AutoResearch with Agentic AI Models

Hugging Face Daily Papers

AutoMedBench is a workflow-aware benchmark for autonomous medical-AI research, evaluating agents across five stages on diverse medical imaging tasks. Stage-level scoring reveals validation as the weakest stage, highlighting the need for reliable verification in agentic workflows.

Introducing HealthBench

OpenAI Blog

OpenAI introduces HealthBench, a new benchmark for evaluating AI systems in healthcare contexts, created with 262 physicians across 60 countries. The benchmark includes 5,000 realistic health conversations with physician-written rubrics to assess model performance on meaningful, trustworthy, and improvable metrics.