Conformal Risk Prediction for Non-Alcoholic Fatty Liver Disease Using Gradient Boosting with Distribution-Free Coverages
Summary
This paper presents LiverRisk, a machine learning framework for NAFLD risk prediction that combines gradient-boosted decision trees with conformal prediction to provide calibrated, distribution-free coverage guarantees on individual risk estimates, achieving high AUROC on internal and external cohorts.
View Cached Full Text
Cached at: 06/10/26, 06:13 AM
# Conformal Risk Prediction for Non-Alcoholic Fatty Liver Disease Using Gradient Boosting with Distribution-Free Coverages
Source: [https://arxiv.org/html/2606.09860](https://arxiv.org/html/2606.09860)
Xinze Zhang1 1University of Southern California, Los Angeles, CA 90007, USA ∗Corresponding author: zhangxinze00@outlook\.com
###### Abstract
Non\-alcoholic fatty liver disease \(NAFLD\) affects roughly a quarter of the global adult population and carries substantial long\-term hepatic and cardiovascular risks, yet population\-level screening tools remain inadequate for early identification of at\-risk individuals\. We presentLiverRisk, a machine\-learning framework for NAFLD risk prediction that couples gradient\-boosted decision trees with conformal prediction to yield calibrated, distribution\-free coverage guarantees on individual risk estimates\. The framework integrates a mutual\-information\-based stability selection procedure that identifies a compact, clinically interpretable feature subset through bootstrap resampling, and constructs conformalized prediction sets whose marginal coverage provably exceeds a user\-specified confidence level under the sole assumption of exchangeability\. We evaluateLiverRiskon a multicenter health examination cohort from Guangzhou, China \(primary cohortn=2,187n\{=\}2\{,\}187; external validationn=412n\{=\}412\), drawing on 78 candidate features spanning demographics, anthropometrics, metabolic biomarkers, liver enzymes, lipid panels, lifestyle factors, and hematological indices\.LiverRiskattains an area under the receiver operating characteristic curve \(AUROC\) of 0\.912 on the internal test set and 0\.891 on the external cohort, outperforming deep neural networks, TabNet, support vector machines, and logistic regression\. Conformal prediction sets achieve empirical coverage of 91\.3% at the nominal 90% level\. A three\-tier risk stratification derived from the conformalized scores separates the population into clinically distinct groups, with the high\-risk subgroup exhibiting a 12\-month progression rate 4\.7 times that of the low\-risk tier\. The selected feature set—dominated by waist circumference, alanine aminotransferase, gamma\-glutamyl transferase, triglycerides, fasting glucose, and body mass index—is consistent with established metabolic risk factors, lending biological plausibility to the model’s decisions\.
Keywords:NAFLD, gradient boosting, conformal prediction, distribution\-free inference, feature selection, clinical risk stratification
## 1Introduction
Non\-alcoholic fatty liver disease encompasses a spectrum of hepatic conditions ranging from simple steatosis to non\-alcoholic steatohepatitis \(NASH\), fibrosis, and cirrhosis, and is now recognized as the most common chronic liver disease worldwide\[[18](https://arxiv.org/html/2606.09860#bib.bib44),[11](https://arxiv.org/html/2606.09860#bib.bib45),[10](https://arxiv.org/html/2606.09860#bib.bib46),[47](https://arxiv.org/html/2606.09860#bib.bib58),[28](https://arxiv.org/html/2606.09860#bib.bib53)\]\. Prevalence estimates vary by geography and diagnostic criteria, but meta\-analyses consistently place the global figure near 25%, with substantially higher rates in populations characterized by metabolic syndrome, obesity, or type\-2 diabetes\[[27](https://arxiv.org/html/2606.09860#bib.bib47),[22](https://arxiv.org/html/2606.09860#bib.bib48),[25](https://arxiv.org/html/2606.09860#bib.bib49),[12](https://arxiv.org/html/2606.09860#bib.bib54)\]\. The clinical burden is compounded by the fact that NAFLD frequently progresses silently: many patients remain undiagnosed until advanced fibrosis or hepatocellular carcinoma emerges, at which point therapeutic options are limited\[[24](https://arxiv.org/html/2606.09860#bib.bib50),[23](https://arxiv.org/html/2606.09860#bib.bib51),[46](https://arxiv.org/html/2606.09860#bib.bib61),[49](https://arxiv.org/html/2606.09860#bib.bib62),[30](https://arxiv.org/html/2606.09860#bib.bib55)\]\. Early risk identification therefore has the potential to redirect clinical resources toward lifestyle intervention and pharmacological management during a window where disease regression is achievable\.
Population\-level screening for NAFLD has traditionally relied on liver ultrasonography or elevated serum aminotransferase levels, both of which suffer from well\-documented limitations\. Ultrasound sensitivity drops below 65% for mild steatosis\[[26](https://arxiv.org/html/2606.09860#bib.bib52),[48](https://arxiv.org/html/2606.09860#bib.bib59),[37](https://arxiv.org/html/2606.09860#bib.bib60)\], while alanine aminotransferase \(ALT\) alone misses a substantial fraction of biopsy\-confirmed NAFLD cases\[[33](https://arxiv.org/html/2606.09860#bib.bib6),[15](https://arxiv.org/html/2606.09860#bib.bib56),[29](https://arxiv.org/html/2606.09860#bib.bib57)\]\. Composite clinical scores such as the Fatty Liver Index \(FLI\)\[[5](https://arxiv.org/html/2606.09860#bib.bib7)\]and the Hepatic Steatosis Index \(HSI\)\[[21](https://arxiv.org/html/2606.09860#bib.bib8)\]improve upon single\-marker approaches but are constrained by the linear or log\-linear functional forms that underlie their construction\. Machine\-learning methods can, in principle, capture the nonlinear, high\-order interactions among metabolic, anthropometric, and lifestyle variables that characterize NAFLD pathophysiology, and a growing body of work has applied random forests\[[32](https://arxiv.org/html/2606.09860#bib.bib9)\], gradient\-boosted trees\[[44](https://arxiv.org/html/2606.09860#bib.bib10)\], and deep neural networks\[[42](https://arxiv.org/html/2606.09860#bib.bib11)\]to NAFLD classification from electronic health records\.
A persistent criticism of machine\-learning risk models in clinical settings is the absence of rigorous uncertainty quantification\. A point prediction—however accurate—gives the clinician no principled way to distinguish a patient whose predicted risk of 0\.72 is tightly concentrated from one whose 0\.72 is highly uncertain\. Conformal prediction\[[43](https://arxiv.org/html/2606.09860#bib.bib17),[38](https://arxiv.org/html/2606.09860#bib.bib18)\]offers an elegant resolution: given any base predictor and an exchangeable calibration sample, conformal prediction constructs prediction sets that contain the true outcome with a user\-specified probability, without distributional assumptions on the data\-generating process\. Recent work has begun to explore conformal methods in medical imaging\[[31](https://arxiv.org/html/2606.09860#bib.bib21)\]and survival analysis\[[7](https://arxiv.org/html/2606.09860#bib.bib22)\], but their application to tabular clinical risk prediction—and specifically to NAFLD screening—remains largely unexplored\.
A second challenge concerns feature selection\. Health examination datasets routinely include dozens to hundreds of laboratory and questionnaire variables, many of which are redundant or irrelevant to a specific disease endpoint\. Selecting a parsimonious feature set is desirable for clinical deployment \(where cost and patient burden matter\), for model interpretability, and for generalization to new populations where certain assays may be unavailable\. Stability selection\[[34](https://arxiv.org/html/2606.09860#bib.bib25)\]addresses the well\-known instability of single\-run feature selection by aggregating results over bootstrap resamples, yielding per\-feature selection probabilities with finite\-sample error control\. We extend this idea by replacing the base selector with a mutual\-information estimator\[[20](https://arxiv.org/html/2606.09860#bib.bib27)\]that captures nonlinear dependencies, matching the capacity of the downstream gradient\-boosted model\.
This paper makes three contributions\. First, we developLiverRisk, a LightGBM\-based prediction framework wrapped with split conformal prediction to produce risk estimates accompanied by distribution\-free prediction intervals, and we prove that the marginal coverage guarantee holds under exchangeability of the calibration data\. Second, we propose a mutual\-information stability selection procedure that identifies a compact feature set whose selection frequency exceeds a theoretically motivated threshold, providing per\-family error rate control on the number of uninformative features admitted\. Third, we validate the complete pipeline on a multicenter cohort from Guangzhou, China, demonstrating strong discrimination \(AUROC 0\.912 internal, 0\.891 external\), well\-calibrated conformal sets, and clinically meaningful three\-tier risk stratification\. Figure[1](https://arxiv.org/html/2606.09860#S1.F1)provides a schematic overview of the clinical motivation and the gap thatLiverRiskaddresses\.
Figure 1:Schematic illustration of the clinical gap in NAFLD screening\. Existing composite scores and single biomarkers \(left\) leave a substantial fraction of at\-risk patients undetected\.LiverRisk\(right\) combines gradient boosting with conformal prediction to provide individualized risk estimates with coverage guarantees, enabling a principled three\-tier stratification for clinical decision\-making\.
## 2Related Work
### 2\.1Machine Learning for NAFLD Prediction
The application of machine learning to NAFLD prediction has accelerated over the past decade, driven by the growing availability of electronic health record \(EHR\) data and the recognition that linear models fail to capture the complex metabolic interactions underlying hepatic steatosis\. Ma et al\.\[[32](https://arxiv.org/html/2606.09860#bib.bib9)\]trained random forests on a Chinese health examination cohort and reported AUROCs around 0\.84, with BMI, triglycerides, and ALT ranking among the top predictors\. Xia et al\.\[[44](https://arxiv.org/html/2606.09860#bib.bib10)\]compared XGBoost\[[9](https://arxiv.org/html/2606.09860#bib.bib12)\]against logistic regression on a Korean population cohort and found a 5–7 percentage\-point AUROC advantage for gradient boosting\. Sowa et al\.\[[42](https://arxiv.org/html/2606.09860#bib.bib11)\]applied deep neural networks to a German tertiary\-care dataset, achieving an AUROC of 0\.87 but noted difficulties with model calibration and interpretability\. Ensemble\-based approaches using LightGBM\[[19](https://arxiv.org/html/2606.09860#bib.bib13)\]have gained traction owing to their computational efficiency and native handling of missing values, which is common in routine health examinations\. TabNet\[[3](https://arxiv.org/html/2606.09860#bib.bib14)\], an attention\-based architecture designed for tabular data, has also been evaluated in hepatological contexts\[[41](https://arxiv.org/html/2606.09860#bib.bib15)\]but tends to require larger sample sizes to outperform well\-tuned tree ensembles\. Across this literature, two gaps persist: the lack of distribution\-free uncertainty quantification for individual risk scores, and the absence of theoretically grounded feature selection that accounts for the instability of variable importance rankings across training runs\.
### 2\.2Conformal Prediction in Clinical Settings
Conformal prediction, originally formulated by Vovk et al\.\[[43](https://arxiv.org/html/2606.09860#bib.bib17)\], provides a model\-agnostic framework for constructing prediction sets with finite\-sample coverage guarantees\. The split \(or inductive\) variant\[[35](https://arxiv.org/html/2606.09860#bib.bib19)\]partitions the available data into a training fold and a calibration fold, fits an arbitrary base learner on the training fold, and uses the calibration residuals \(or nonconformity scores\) to set thresholds that define prediction sets at a target miscoverage rateα\\alpha\. The only assumption required for the marginal coverage guarantee is exchangeability of calibration and test points, which is strictly weaker than the i\.i\.d\. assumption and permits certain forms of distributional shift\[[4](https://arxiv.org/html/2606.09860#bib.bib20)\]\. In clinical applications, Lu et al\.\[[31](https://arxiv.org/html/2606.09860#bib.bib21)\]applied conformal prediction to dermatological image classification and demonstrated that coverage could be maintained across demographic subgroups through group\-conditional calibration\. Candès et al\.\[[7](https://arxiv.org/html/2606.09860#bib.bib22)\]extended conformal ideas to survival analysis, producing conformalized survival curves that cover the true event time with a specified probability\. Angelopoulos and Bates\[[1](https://arxiv.org/html/2606.09860#bib.bib23)\]provided an accessible tutorial that has catalyzed adoption in applied settings\. Despite this momentum, conformal prediction has seen limited use in tabular clinical risk models, where the combination of mixed feature types, moderate sample sizes, and the need for clinically actionable outputs presents distinct challenges that our work addresses\.
### 2\.3Feature Selection with Stability Guarantees
Feature selection for clinical prediction models must balance predictive performance with interpretability, cost, and robustness to data perturbations\. Classical filter methods rank features by univariate association measures—such as theχ2\\chi^\{2\}statistic, mutual information\[[13](https://arxiv.org/html/2606.09860#bib.bib24)\], or correlation—and select the top\-kk, but they ignore feature interactions and are sensitive to the particular training sample drawn\. Wrapper methods, including recursive feature elimination\[[17](https://arxiv.org/html/2606.09860#bib.bib28)\], iteratively retrain the model and are computationally expensive\. Embedded methods, such asℓ1\\ell\_\{1\}\-penalized regression or tree\-based importance\[[6](https://arxiv.org/html/2606.09860#bib.bib30)\], are efficient but produce unstable rankings when features are correlated, as is typical of metabolic biomarker panels\. Stability selection\[[34](https://arxiv.org/html/2606.09860#bib.bib25)\]addresses this fragility by repeating the selection step on random subsamples of the data and retaining only those features whose selection frequency exceeds a thresholdπthr\\pi\_\{\\text\{thr\}\}\. Meinshausen and Bühlmann showed that the expected number of falsely selected variables \(per\-family error rate\) can be bounded as a function ofπthr\\pi\_\{\\text\{thr\}\}and the expected number of selected features per subsample\. Shah and Samworth\[[39](https://arxiv.org/html/2606.09860#bib.bib26)\]refined these bounds and introduced complementary pairs stability selection\. In our work, we replace the base selector with akk\-nearest\-neighbor mutual information estimator\[[20](https://arxiv.org/html/2606.09860#bib.bib27)\], which captures nonlinear dependencies without parametric distributional assumptions, and we integrate the resulting feature set into the LightGBM training loop\.
## 3Methodology
### 3\.1Problem Setup
Let\{\(xi,yi\)\}i=1n\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{n\}denote an exchangeable sequence of feature–label pairs, wherexi∈𝒳⊆ℝdx\_\{i\}\\in\\mathcal\{X\}\\subseteq\\mathbb\{R\}^\{d\}comprisesddclinical features andyi∈\{0,1\}y\_\{i\}\\in\\\{0,1\\\}indicates NAFLD status \(yi=1y\_\{i\}\{=\}1for NAFLD\-positive\)\. We seek a scoring functionf:𝒳→\[0,1\]f:\\mathcal\{X\}\\to\[0,1\]whose outputp^\(x\)=f\(x\)\\hat\{p\}\(x\)=f\(x\)estimates the conditional probabilityPr\(Y=1∣X=x\)\\Pr\(Y\{=\}1\\mid X\{=\}x\), together with a prediction\-set function𝒞α:𝒳→2\{0,1\}\\mathcal\{C\}\_\{\\alpha\}:\\mathcal\{X\}\\to 2^\{\\\{0,1\\\}\}that satisfies the marginal coverage property
Pr\(Yn\+1∈𝒞α\(Xn\+1\)\)≥1−α\\Pr\\\!\\bigl\(Y\_\{n\+1\}\\in\\mathcal\{C\}\_\{\\alpha\}\(X\_\{n\+1\}\)\\bigr\)\\;\\geq\\;1\-\\alpha\(1\)for a user\-specified miscoverage levelα∈\(0,1\)\\alpha\\in\(0,1\), under the sole assumption that\(X1,Y1\),…,\(Xn\+1,Yn\+1\)\(X\_\{1\},Y\_\{1\}\),\\ldots,\(X\_\{n\+1\},Y\_\{n\+1\}\)are exchangeable\.
We also seek a feature subsetS⊆\{1,…,d\}S\\subseteq\\\{1,\\ldots,d\\\}with\|S\|≪d\|S\|\\ll dsuch that the restricted modelfSf\_\{S\}, operating onxS=\(xj\)j∈Sx\_\{S\}=\(x\_\{j\}\)\_\{j\\in S\}, retains competitive discrimination while admitting interpretable clinical explanations and controlling the per\-family error rate on spurious selections\.
### 3\.2Gradient\-Boosted Decision Trees
LiverRiskuses LightGBM\[[19](https://arxiv.org/html/2606.09860#bib.bib13)\]as its base learner\. LightGBM constructs an additive ensemble ofTTregression trees,
F\(x\)=∑t=1Tηht\(x\),F\(x\)=\\sum\_\{t=1\}^\{T\}\\eta\\,h\_\{t\}\(x\),\(2\)wherehth\_\{t\}is thett\-th tree andη∈\(0,1\]\\eta\\in\(0,1\]is the learning rate \(shrinkage parameter\)\. Each treehth\_\{t\}is grown by leaf\-wise splitting, choosing the split that maximizes the reduction in a second\-order approximation of the training loss\[[9](https://arxiv.org/html/2606.09860#bib.bib12)\]\. For binary classification we use the logistic loss,
ℒ\(y,F\(x\)\)=−\[ylogσ\(F\(x\)\)\+\(1−y\)log\(1−σ\(F\(x\)\)\)\],\\mathcal\{L\}\(y,F\(x\)\)=\-\\bigl\[y\\log\\sigma\(F\(x\)\)\+\(1\{\-\}y\)\\log\(1\{\-\}\\sigma\(F\(x\)\)\)\\bigr\],\(3\)whereσ\(⋅\)\\sigma\(\\cdot\)is the sigmoid function\. The predicted probability isp^\(x\)=σ\(F\(x\)\)\\hat\{p\}\(x\)=\\sigma\\\!\\bigl\(F\(x\)\\bigr\)\. LightGBM introduces two key algorithmic innovations—gradient\-based one\-side sampling \(GOSS\) and exclusive feature bundling \(EFB\)—that accelerate training on large, sparse feature matrices while preserving near\-lossless accuracy\. Regularization is controlled through the maximum tree depthdmaxd\_\{\\max\}, the minimum number of samples per leafnleafn\_\{\\text\{leaf\}\}, theℓ2\\ell\_\{2\}leaf\-weight penaltyλ\\lambda, and the feature\-sampling fractionρf\\rho\_\{f\}\.
### 3\.3Conformal Prediction Framework
We adopt the split conformal prediction protocol\[[35](https://arxiv.org/html/2606.09860#bib.bib19),[43](https://arxiv.org/html/2606.09860#bib.bib17)\]\. After drawing a stratified random partition of the labeled data into a proper training set𝒟train\\mathcal\{D\}\_\{\\text\{train\}\}and a calibration set𝒟cal\\mathcal\{D\}\_\{\\text\{cal\}\}with\|𝒟cal\|=m\|\\mathcal\{D\}\_\{\\text\{cal\}\}\|=m, we train the LightGBM model on𝒟train\\mathcal\{D\}\_\{\\text\{train\}\}and define a nonconformity score for each calibration point\(xi,yi\)∈𝒟cal\(x\_\{i\},y\_\{i\}\)\\in\\mathcal\{D\}\_\{\\text\{cal\}\}:
si=1−p^yi\(xi\),s\_\{i\}=1\-\\hat\{p\}\_\{y\_\{i\}\}\(x\_\{i\}\),\(4\)wherep^yi\(xi\)\\hat\{p\}\_\{y\_\{i\}\}\(x\_\{i\}\)denotes the model’s predicted probability for the true classyiy\_\{i\}\. A high score signals that the model assigns low probability to the correct label, indicating poor conformity\. Lets\(1\)≤s\(2\)≤⋯≤s\(m\)s\_\{\(1\)\}\\leq s\_\{\(2\)\}\\leq\\cdots\\leq s\_\{\(m\)\}be the sorted calibration scores and define the conformal quantile
q^α=s\(⌈\(1−α\)\(m\+1\)⌉\)\.\\hat\{q\}\_\{\\alpha\}=s\_\{\\bigl\(\\lceil\(1\-\\alpha\)\(m\+1\)\\rceil\\bigr\)\}\.\(5\)The prediction set for a new test pointxn\+1x\_\{n\+1\}is then
𝒞α\(xn\+1\)=\{y∈\{0,1\}:1−p^y\(xn\+1\)≤q^α\}\.\\mathcal\{C\}\_\{\\alpha\}\(x\_\{n\+1\}\)=\\bigl\\\{y\\in\\\{0,1\\\}:1\-\\hat\{p\}\_\{y\}\(x\_\{n\+1\}\)\\leq\\hat\{q\}\_\{\\alpha\}\\bigr\\\}\.\(6\)
###### Theorem 1\(Marginal coverage guarantee\)\.
Suppose\(X1,Y1\),…,\(Xm,Ym\),\(Xn\+1,Yn\+1\)\(X\_\{1\},Y\_\{1\}\),\\ldots,\(X\_\{m\},Y\_\{m\}\),\(X\_\{n\+1\},Y\_\{n\+1\}\)are exchangeable random variables\. Then the prediction set𝒞α\\mathcal\{C\}\_\{\\alpha\}defined in \([6](https://arxiv.org/html/2606.09860#S3.E6)\) satisfies
Pr\(Yn\+1∈𝒞α\(Xn\+1\)\)≥1−α\.\\Pr\\\!\\bigl\(Y\_\{n\+1\}\\in\\mathcal\{C\}\_\{\\alpha\}\(X\_\{n\+1\}\)\\bigr\)\\;\\geq\\;1\-\\alpha\.\(7\)If, in addition, the nonconformity scores have a continuous joint distribution, then
Pr\(Yn\+1∈𝒞α\(Xn\+1\)\)≤1−α\+1m\+1\.\\Pr\\\!\\bigl\(Y\_\{n\+1\}\\in\\mathcal\{C\}\_\{\\alpha\}\(X\_\{n\+1\}\)\\bigr\)\\;\\leq\\;1\-\\alpha\+\\frac\{1\}\{m\+1\}\.\(8\)
###### Proof\.
By exchangeability, the rank ofsn\+1s\_\{n\+1\}among\{s1,…,sm,sn\+1\}\\\{s\_\{1\},\\ldots,s\_\{m\},s\_\{n\+1\}\\\}is uniformly distributed on\{1,…,m\+1\}\\\{1,\\ldots,m\{\+\}1\\\}\. The eventYn\+1∈𝒞α\(Xn\+1\)Y\_\{n\+1\}\\in\\mathcal\{C\}\_\{\\alpha\}\(X\_\{n\+1\}\)is equivalent tosn\+1≤q^αs\_\{n\+1\}\\leq\\hat\{q\}\_\{\\alpha\}\. Sinceq^α\\hat\{q\}\_\{\\alpha\}is the⌈\(1−α\)\(m\+1\)⌉\\lceil\(1\{\-\}\\alpha\)\(m\{\+\}1\)\\rceil\-th smallest calibration score, we have
Pr\(sn\+1≤q^α\)=⌈\(1−α\)\(m\+1\)⌉m\+1≥1−α\.\\Pr\(s\_\{n\+1\}\\leq\\hat\{q\}\_\{\\alpha\}\)=\\frac\{\\lceil\(1\{\-\}\\alpha\)\(m\{\+\}1\)\\rceil\}\{m\{\+\}1\}\\geq 1\-\\alpha\.When the scores are continuously distributed, ties occur with probability zero, and the rank ofsn\+1s\_\{n\+1\}is exactly uniform on\{1,…,m\+1\}\\\{1,\\ldots,m\{\+\}1\\\}, giving
Pr\(sn\+1≤q^α\)=⌈\(1−α\)\(m\+1\)⌉m\+1≤\(1−α\)\(m\+1\)\+1m\+1=1−α\+1m\+1\.∎\\Pr\(s\_\{n\+1\}\\leq\\hat\{q\}\_\{\\alpha\}\)=\\frac\{\\lceil\(1\{\-\}\\alpha\)\(m\{\+\}1\)\\rceil\}\{m\{\+\}1\}\\leq\\frac\{\(1\{\-\}\\alpha\)\(m\{\+\}1\)\+1\}\{m\{\+\}1\}=1\-\\alpha\+\\frac\{1\}\{m\+1\}\.\\qed
###### Corollary 1\(Prediction\-set size interpretation\)\.
Under the conditions of Theorem[1](https://arxiv.org/html/2606.09860#Thmtheorem1), the prediction set𝒞α\(x\)\\mathcal\{C\}\_\{\\alpha\}\(x\)for binary classification satisfies\|𝒞α\(x\)\|∈\{0,1,2\}\|\\mathcal\{C\}\_\{\\alpha\}\(x\)\|\\in\\\{0,1,2\\\}\. A singleton set𝒞α\(x\)=\{y\}\\mathcal\{C\}\_\{\\alpha\}\(x\)=\\\{y\\\}indicates that the model is confident in classyyat level1−α1\{\-\}\\alpha\. A set of size two signals ambiguity, while an empty set \(which occurs with probability at mostα/\(m\+1\)\\alpha/\(m\{\+\}1\)\) flags an outlier\.
For clinical risk stratification, we define a conformalized risk score as
r\(x\)=p^1\(x\)\+γ⋅𝟙\[\|𝒞α\(x\)\|=2\],r\(x\)=\\hat\{p\}\_\{1\}\(x\)\+\\gamma\\cdot\\mathbb\{1\}\\bigl\[\|\\mathcal\{C\}\_\{\\alpha\}\(x\)\|=2\\bigr\],\(9\)whereγ\>0\\gamma\>0is a penalty term that elevates the effective risk for patients whose prediction sets are ambiguous, reflecting the clinical desirability of a conservative stance under uncertainty\.
### 3\.4Stability\-Based Feature Selection
We employ a mutual\-information stability selection procedure to identify a stable and informative feature subset\. For a candidate featureXjX\_\{j\}and the targetYY, the mutual informationI\(Xj;Y\)I\(X\_\{j\};Y\)quantifies the reduction in uncertainty aboutYYobtained by observingXjX\_\{j\}:
I\(Xj;Y\)=∑y∈\{0,1\}∫p\(xj,y\)logp\(xj,y\)p\(xj\)p\(y\)dxj\.I\(X\_\{j\};Y\)=\\sum\_\{y\\in\\\{0,1\\\}\}\\int p\(x\_\{j\},y\)\\log\\frac\{p\(x\_\{j\},y\)\}\{p\(x\_\{j\}\)\\,p\(y\)\}\\,dx\_\{j\}\.\(10\)We estimateI\(Xj;Y\)I\(X\_\{j\};Y\)using the Kraskov–Stögbauer–Grassberger \(KSG\)kk\-nearest\-neighbor estimator\[[20](https://arxiv.org/html/2606.09860#bib.bib27)\], which is consistent and has favorable bias properties for mixed continuous\-discrete distributions\.
The stability selection procedure operates as follows\. We drawBBbootstrap resamples of size⌊n/2⌋\\lfloor n/2\\rfloorfrom the training set \(without replacement within each resample\)\. On each resamplebb, we computeI^\(b\)\(Xj;Y\)\\hat\{I\}^\{\(b\)\}\(X\_\{j\};Y\)for every featurejjand retain the top\-qqfeatures, whereqqis a budget hyperparameter\. The selection probability for featurejjis
Π^j=1B∑b=1B𝟙\[j∈S^q\(b\)\],\\hat\{\\Pi\}\_\{j\}=\\frac\{1\}\{B\}\\sum\_\{b=1\}^\{B\}\\mathbb\{1\}\\bigl\[j\\in\\hat\{S\}^\{\(b\)\}\_\{q\}\\bigr\],\(11\)and the final selected set isS^=\{j:Π^j≥πthr\}\\hat\{S\}=\\\{j:\\hat\{\\Pi\}\_\{j\}\\geq\\pi\_\{\\text\{thr\}\}\\\}\. Meinshausen and Bühlmann\[[34](https://arxiv.org/html/2606.09860#bib.bib25)\]showed that the expected number of falsely selected features satisfies
𝔼\[V\]≤q2\(2πthr−1\)d,\\mathbb\{E\}\[V\]\\leq\\frac\{q^\{2\}\}\{\(2\\pi\_\{\\text\{thr\}\}\-1\)\\,d\},\(12\)whereVVis the number of noise features inS^\\hat\{S\}\. For our setting withd=78d=78,q=20q=20, andπthr=0\.75\\pi\_\{\\text\{thr\}\}=0\.75, this yields𝔼\[V\]≤10\.26\\mathbb\{E\}\[V\]\\leq 10\.26\. This worst\-case bound is deliberately conservative: it assumes that allddfeatures except the true ones are pure noise, which is unrealistic for correlated metabolic panels\. In our experiments, only 14 features exceedπthr\\pi\_\{\\text\{thr\}\}and all belong to clinically established risk factor categories \(Table[2](https://arxiv.org/html/2606.09860#S4.T2)\), suggesting that the actual number of false inclusions is close to zero\. Raisingπthr\\pi\_\{\\text\{thr\}\}to 0\.85 would tighten the bound to𝔼\[V\]≤5\.71\\mathbb\{E\}\[V\]\\leq 5\.71at the cost of potentially dropping borderline\-informative variables\.
### 3\.5Algorithmic Pipeline
Algorithm[1](https://arxiv.org/html/2606.09860#alg1)summarizes the completeLiverRiskpipeline\. The procedure accepts a labeled dataset, partitions it into training, calibration, and test folds, performs stability\-based feature selection on the training fold, trains a LightGBM model on the selected features, computes conformal quantiles on the calibration fold, and returns predicted probabilities together with conformalized prediction sets and risk scores on the test fold\. Figure[2](https://arxiv.org/html/2606.09860#S3.F2)provides a graphical overview of the pipeline\.
Algorithm 1TheLiverRiskPipeline0:Labeled data
𝒟=\{\(xi,yi\)\}i=1n\\mathcal\{D\}=\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{n\}, miscoverage level
α\\alpha, bootstrap count
BB, feature budget
qq, stability threshold
πthr\\pi\_\{\\text\{thr\}\}
0:Risk scores
\{ri\}\\\{r\_\{i\}\\\}, prediction sets
\{𝒞α\(xi\)\}\\\{\\mathcal\{C\}\_\{\\alpha\}\(x\_\{i\}\)\\\}for test data
1:Split
𝒟\\mathcal\{D\}into
𝒟train\\mathcal\{D\}\_\{\\text\{train\}\},
𝒟cal\\mathcal\{D\}\_\{\\text\{cal\}\},
𝒟test\\mathcal\{D\}\_\{\\text\{test\}\}\(60%/20%/20%\)
2:// Stability\-based feature selection
3:for
b=1b=1to
BBdo
4:Draw subsample
𝒟\(b\)⊂𝒟train\\mathcal\{D\}^\{\(b\)\}\\subset\\mathcal\{D\}\_\{\\text\{train\}\}of size
⌊\|𝒟train\|/2⌋\\lfloor\|\\mathcal\{D\}\_\{\\text\{train\}\}\|/2\\rfloor
5:Estimate
I^\(b\)\(Xj;Y\)\\hat\{I\}^\{\(b\)\}\(X\_\{j\};Y\)for all
jjvia KSG estimator
6:
S^q\(b\)←\\hat\{S\}^\{\(b\)\}\_\{q\}\\leftarrowtop\-
qqfeatures by
I^\(b\)\\hat\{I\}^\{\(b\)\}
7:endfor
8:
Π^j←B−1∑b𝟙\[j∈S^q\(b\)\]\\hat\{\\Pi\}\_\{j\}\\leftarrow B^\{\-1\}\\sum\_\{b\}\\mathbb\{1\}\[j\\in\\hat\{S\}^\{\(b\)\}\_\{q\}\]for all
jj
9:
S^←\{j:Π^j≥πthr\}\\hat\{S\}\\leftarrow\\\{j:\\hat\{\\Pi\}\_\{j\}\\geq\\pi\_\{\\text\{thr\}\}\\\}
10:// Model training
11:Train LightGBM on
𝒟train\\mathcal\{D\}\_\{\\text\{train\}\}restricted to features
S^\\hat\{S\}
12:// Conformal calibration
13:foreach
\(xi,yi\)∈𝒟cal\(x\_\{i\},y\_\{i\}\)\\in\\mathcal\{D\}\_\{\\text\{cal\}\}do
14:
si←1−p^yi\(xi\)s\_\{i\}\\leftarrow 1\-\\hat\{p\}\_\{y\_\{i\}\}\(x\_\{i\}\)
15:endfor
16:
q^α←s\(⌈\(1−α\)\(m\+1\)⌉\)\\hat\{q\}\_\{\\alpha\}\\leftarrow s\_\{\(\\lceil\(1\-\\alpha\)\(m\+1\)\\rceil\)\}
17:// Prediction and risk scoring
18:foreach
xj∈𝒟testx\_\{j\}\\in\\mathcal\{D\}\_\{\\text\{test\}\}do
19:
𝒞α\(xj\)←\{y:1−p^y\(xj\)≤q^α\}\\mathcal\{C\}\_\{\\alpha\}\(x\_\{j\}\)\\leftarrow\\\{y:1\-\\hat\{p\}\_\{y\}\(x\_\{j\}\)\\leq\\hat\{q\}\_\{\\alpha\}\\\}
20:
r\(xj\)←p^1\(xj\)\+γ⋅𝟙\[\|𝒞α\(xj\)\|=2\]r\(x\_\{j\}\)\\leftarrow\\hat\{p\}\_\{1\}\(x\_\{j\}\)\+\\gamma\\cdot\\mathbb\{1\}\[\|\\mathcal\{C\}\_\{\\alpha\}\(x\_\{j\}\)\|=2\]
21:endfor
22:return
\{r\(xj\)\}\\\{r\(x\_\{j\}\)\\\},
\{𝒞α\(xj\)\}\\\{\\mathcal\{C\}\_\{\\alpha\}\(x\_\{j\}\)\\\}
Figure 2:Overview of theLiverRiskpipeline\. Raw clinical features undergo stability\-based feature selection \(left\), the reduced feature set is used to train a LightGBM model \(center\), and conformal calibration produces prediction sets with coverage guarantees \(right\)\. The conformalized risk score drives a three\-tier risk stratification for clinical decision support\.
## 4Experiments
### 4\.1Dataset and Experimental Setup
We retrospectively collected health examination records from three centers affiliated with the Guangzhou municipal health screening program between January 2021 and December 2024\. The primary cohort comprises 2,187 adults \(≥\\geq18 years\) from two centers \(Center A and Center B\), while the external validation cohort consists of 412 adults from a third geographically distinct center \(Center C\) that uses different laboratory instrumentation\. NAFLD was diagnosed by abdominal ultrasonography performed by experienced sonographers, following the 2018 AASLD practice guidance\[[8](https://arxiv.org/html/2606.09860#bib.bib2)\]\. Participants with excessive alcohol consumption \(\>\>30 g/day for men,\>\>20 g/day for women\), viral hepatitis \(HBsAg\- or anti\-HCV\-positive\), or other known hepatic conditions were excluded\.
The feature space includes 78 variables: age, sex, height, weight, BMI, waist circumference, hip circumference, waist\-to\-hip ratio, systolic and diastolic blood pressure, heart rate; fasting glucose, HbA1c, fasting insulin, HOMA\-IR; total cholesterol, LDL\-cholesterol, HDL\-cholesterol, triglycerides, apolipoprotein A1, apolipoprotein B; ALT, AST, GGT, alkaline phosphatase, total bilirubin, direct bilirubin, albumin; blood urea nitrogen, creatinine, uric acid; white blood cell count, red blood cell count, hemoglobin, platelet count, mean corpuscular volume, mean corpuscular hemoglobin, mean platelet volume, red cell distribution width; and lifestyle questionnaire items covering smoking status, alcohol consumption frequency, exercise frequency, sleep duration, and dietary habits \(encoded as ordinal scales\)\. Missing values occurred in 4\.2% of entries and were handled by LightGBM’s native missing\-value routing, which assigns missing observations to the child node that minimizes the training loss\.
The primary cohort was split into training \(60%,n=1,312n\{=\}1\{,\}312\), calibration \(20%,n=437n\{=\}437\), and test \(20%,n=438n\{=\}438\) sets using stratified random sampling to preserve the NAFLD prevalence of 38\.6%\. The external cohort \(n=412n\{=\}412, prevalence 36\.2%\) was used exclusively for out\-of\-distribution evaluation\. Hyperparameters were tuned via five\-fold cross\-validation on the training set; the final configuration used 800 boosting rounds, a learning rate of 0\.03, maximum depth 7, 50 minimum samples per leaf,ℓ2\\ell\_\{2\}regularizationλ=1\.5\\lambda\{=\}1\.5, and feature fraction 0\.8\. Stability selection usedB=200B\{=\}200bootstrap resamples,q=20q\{=\}20, andπthr=0\.75\\pi\_\{\\text\{thr\}\}\{=\}0\.75\. The conformal miscoverage level was set atα=0\.10\\alpha\{=\}0\.10\.
### 4\.2Main Results
Table[1](https://arxiv.org/html/2606.09860#S4.T1)reports discrimination and calibration metrics forLiverRiskand six baseline methods on the internal test set and external validation cohort\.LiverRiskachieves the highest AUROC on both cohorts \(0\.912 internal, 0\.891 external\), followed by XGBoost \(0\.904, 0\.882\) and the deep neural network \(DNN; 0\.893, 0\.864\)\. TabNet, which uses sequential attention for feature selection, achieves competitive performance internally \(AUROC 0\.889\) but degrades more substantially on the external cohort \(0\.851\), consistent with reports that attention\-based tabular models are prone to overfitting on moderately sized datasets\. Support vector machines with an RBF kernel \(SVM\-RBF\) attain an AUROC of 0\.873 internally but drop to 0\.842 externally, while logistic regression \(LR\) andkk\-nearest neighbors \(kk\-NN\) trail at 0\.845/0\.831 and 0\.812/0\.793, respectively\.
Table 1:Comparison ofLiverRiskagainst baseline methods on the internal test set \(n=438n\{=\}438\) and external validation cohort \(n=412n\{=\}412\)\. Best results are inbold; second\-best areunderlined\. AUROC: area under the ROC curve; AUPRC: area under the precision–recall curve; Brier: Brier score \(lower is better\); ECE: expected calibration error \(lower is better\)\. 95% confidence intervals are computed via 1,000 bootstrap resamples\.Calibration metrics reinforce the advantage ofLiverRisk\. The Brier score of 0\.148 \(internal\) and 0\.162 \(external\) improves over the next\-best model \(XGBoost at 0\.153 and SVM\-RBF at 0\.169, respectively\) by 3\.3% and 4\.1%, while the expected calibration error \(ECE\) of 0\.023 and 0\.031 indicates that predicted probabilities closely track observed frequencies across decile bins\. The DNN and TabNet yield considerably higher ECE values \(0\.041–0\.052\), reflecting the known miscalibration of neural network classifiers in the absence of post\-hoc recalibration\. The SVM\-RBF records a comparatively low internal ECE \(0\.027\), aided by Platt scaling\[[36](https://arxiv.org/html/2606.09860#bib.bib34)\]applied during cross\-validation; the small increase to 0\.034 on the external cohort suggests that the scaling parameters transfer reasonably well, though the SVM’s discrimination \(AUROC 0\.842\) lags behind tree\-based methods\.
Figure[3](https://arxiv.org/html/2606.09860#S4.F3)displays the AUROC comparison as a grouped bar chart across all methods and both evaluation cohorts, visually confirming the consistent advantage ofLiverRisk\.
Figure 3:AUROC on the \(a\) internal test set and \(b\) external validation cohort for all seven methods in Table[1](https://arxiv.org/html/2606.09860#S4.T1)\.LiverRiskachieves the highest score on both cohorts\.
### 4\.3Analysis and Ablation
#### 4\.3\.1Feature importance\.
Table[2](https://arxiv.org/html/2606.09860#S4.T2)lists the top 15 features by stability selection probabilityΠ^j\\hat\{\\Pi\}\_\{j\}alongside the LightGBM split\-based importance \(normalized to sum to one over all selected features\)\. Waist circumference achieves the highest selection probability \(0\.97\) and the largest share of splits \(0\.141\), aligning with clinical evidence that visceral adiposity is a primary driver of hepatic fat accumulation\[[45](https://arxiv.org/html/2606.09860#bib.bib32)\]\. ALT and GGT follow with selection probabilities of 0\.95 and 0\.93, consistent with their role as liver injury surrogates\. Triglycerides \(0\.91\) and fasting glucose \(0\.89\) reflect the metabolic syndrome nexus, while BMI \(0\.88\) and HDL\-cholesterol \(0\.86\) capture complementary adiposity and lipid dimensions\. The ordering is not strictly monotonic with respect to either metric: uric acid, for instance, ranks 8th by selection probability \(0\.84\) but 11th by split importance \(0\.044\), suggesting that it contributes to a smaller number of highly informative splits\. Such discrepancies between filter\-based and embedded importance rankings are expected and underscore the value of a two\-stage selection procedure\.
Table 2:Top 15 features ranked by stability selection probability \(Π^\\hat\{\\Pi\}\)\. Split importance is normalized over the selected feature set\. Features above the dashed line compose the final selected set \(Π^≥0\.75\\hat\{\\Pi\}\\geq 0\.75\)\.
#### 4\.3\.2Feature group ablation\.
To assess the contribution of each clinical domain, we retrainLiverRiskafter removing one feature group at a time \(Table[3](https://arxiv.org/html/2606.09860#S4.T3)\)\. Dropping the metabolic biomarker group \(including fasting glucose, HbA1c, HOMA\-IR, fasting insulin, uric acid, creatinine, and related markers\) produces the largest AUROC decrease on the internal test set \(−\-0\.031\), followed by liver enzymes \(including ALT, AST, GGT, ALP, bilirubin, and albumin;−\-0\.028\) and the lipid panel \(total cholesterol, LDL, HDL, triglycerides, apolipoproteins;−\-0\.024\)\. Removing anthropometrics causes a reduction of 0\.019, while dropping demographic features \(age, sex\) has a more modest effect \(−\-0\.009\)\. Hematological indices contribute the least individually \(−\-0\.006\), though their removal slightly increases the ECE, suggesting they aid calibration even when their discriminative contribution is limited\. Figure[4](https://arxiv.org/html/2606.09860#S4.F4)presents these results as a bar chart\.
Table 3:Feature group ablation study\. Each row shows the AUROC and ECE when one feature group is removed from the full model\.Δ\\DeltaAUROC is the change relative to the full model \(0\.912\)\.Figure 4:Feature group ablation: AUROC decrease when each feature group is removed from the fullLiverRiskmodel\. Metabolic biomarkers and liver enzymes contribute the most to discrimination, followed by the lipid panel and anthropometrics\.
#### 4\.3\.3Calibration analysis\.
We evaluate calibration through reliability diagrams and the Hosmer–Lemeshow goodness\-of\-fit test\. After conformal calibration, the predicted\-vs\-observed frequency plot forLiverRisklies close to the diagonal across all decile bins, with a Hosmer–Lemeshowpp\-value of 0\.42 \(internal\) and 0\.29 \(external\), indicating no significant departure from perfect calibration\. By contrast, the uncalibrated DNN yieldsp<0\.001p<0\.001on both cohorts, and TabNet achievesp=0\.03p=0\.03internally\. XGBoost, which shares the tree\-based architecture, is reasonably well calibrated \(p=0\.18p=0\.18internal\) but degrades on the external cohort \(p=0\.04p=0\.04\)\.
### 4\.4Generalization and Robustness
#### 4\.4\.1External validation\.
The right half of Table[1](https://arxiv.org/html/2606.09860#S4.T1)reports external validation performance\.LiverRiskexhibits the smallest AUROC drop from internal to external evaluation \(−\-0\.021\), compared to−\-0\.022 for XGBoost,−\-0\.029 for DNN, and−\-0\.038 for TabNet\. This stability can be attributed partly to the conformal calibration procedure, which adjusts the decision boundary using held\-out calibration data and is therefore less sensitive to systematic differences in laboratory assay scales across centers\.
#### 4\.4\.2Subgroup analysis\.
Table[4](https://arxiv.org/html/2606.09860#S4.T4)reports AUROC and conformal coverage for clinically relevant subgroups on the combined internal\-test and external\-validation data\. Performance is generally stable across age and sex strata\. The AUROC is slightly lower for participants aged≥\\geq60 \(0\.883 vs\. 0\.908 for<<60\), reflecting the increased diagnostic ambiguity of NAFLD in the elderly, where competing hepatic pathologies are more prevalent\. Coverage remains at or above the nominal 90% level in all subgroups examined, confirming that the conformal guarantee holds marginally even when subgroup\-specific calibration is not enforced\.
Table 4:Subgroup analysis on combined internal\-test and external\-validation data\. Coverage is empirical coverage of the 90%\-level conformal prediction set\.
#### 4\.4\.3Cross\-validation stability\.
To assess sensitivity to the particular train–calibration–test split, we repeat the entireLiverRiskpipeline \(including feature selection and conformal calibration\) across 10 random stratified partitions of the primary cohort\. The mean AUROC is 0\.910 \(standard deviation 0\.008\), the mean coverage is 0\.912 \(s\.d\. 0\.006\), and the mean number of selected features is 13\.7 \(s\.d\. 1\.2\)\. These narrow confidence bands indicate that neither the feature selection step nor the conformal threshold is unduly sensitive to the data partition\.
#### 4\.4\.4Coverage bound verification\.
Theorem[1](https://arxiv.org/html/2606.09860#Thmtheorem1)guarantees marginal coverage≥1−α=0\.90\\geq 1\{\-\}\\alpha=0\.90\. To verify this, we repeat the calibration–evaluation cycle 1,000 times, each time drawing a fresh calibration set of sizem=437m\{=\}437from the pooled training data and measuring coverage on the held\-out test set\. The empirical coverage ranges from 0\.901 to 0\.938, with a mean of 0\.913 and a median of 0\.912\. None of the 1,000 trials fell below the nominal 0\.90 level, consistent with the theoretical lower bound\. The mean coverage of 0\.913 exceeds the continuity\-based upper bound of1−α\+1/\(m\+1\)≈0\.9021\{\-\}\\alpha\+1/\(m\{\+\}1\)\\approx 0\.902from Theorem[1](https://arxiv.org/html/2606.09860#Thmtheorem1); this is expected because LightGBM outputs discrete probability scores \(due to finite tree structure\), so ties among nonconformity scores inflate the effective quantile and push coverage above the continuous\-score limit\. The lower bound1−α1\{\-\}\\alpha, which requires only exchangeability and not continuity, holds across all 1,000 trials\.
#### 4\.4\.5Feature cardinality sensitivity\.
Figure[5](https://arxiv.org/html/2606.09860#S4.F5)plots the internal AUROC as a function of the number of selected features, obtained by varyingπthr\\pi\_\{\\text\{thr\}\}from 0\.50 to 0\.95\. Performance rises steeply from 5 to 10 features and plateaus beyond roughly 14 features, with marginal gains below 0\.003 AUROC for each additional variable beyond this point\. The operating point atπthr=0\.75\\pi\_\{\\text\{thr\}\}\{=\}0\.75\(14 features\) sits at the knee of the curve, capturing the bulk of the predictive signal with fewer than one\-fifth of the original variables\.
Figure 5:AUROC as a function of the number of features selected by stability selection \(varyingπthr\\pi\_\{\\text\{thr\}\}\)\. The shaded band shows±\\pm1 s\.d\. over 10 cross\-validation folds\. The vertical dashed line marks the operating point used in the main experiments\.
#### 4\.4\.6Hyperparameter sensitivity\.
Figure[6](https://arxiv.org/html/2606.09860#S4.F6)presents a heatmap of AUROC as a function of two key hyperparameters: the learning rateη\\etaand the maximum tree depthdmaxd\_\{\\max\}\. Performance is stable acrossη∈\[0\.01,0\.05\]\\eta\\in\[0\.01,0\.05\]anddmax∈\[5,9\]d\_\{\\max\}\\in\[5,9\], with the optimum atη=0\.03\\eta\{=\}0\.03,dmax=7d\_\{\\max\}\{=\}7\. Shallow trees \(dmax=3d\_\{\\max\}\{=\}3\) underfit the metabolic interaction structure, while very deep trees \(dmax≥11d\_\{\\max\}\\geq 11\) overfit, particularly on the external cohort\.
Figure 6:Internal\-test AUROC for different combinations of learning rateη\\eta\(rows\) and maximum tree depthdmaxd\_\{\\max\}\(columns\)\. The white cross marks the selected configuration \(η=0\.03\\eta\{=\}0\.03,dmax=7d\_\{\\max\}\{=\}7\)\. Accuracy is highest in the central region and degrades at both extremes\.
### 4\.5Clinical Utility
#### 4\.5\.1Risk stratification\.
Using the conformalized risk scorer\(x\)r\(x\)defined in Eq\. \([9](https://arxiv.org/html/2606.09860#S3.E9)\) withγ=0\.05\\gamma\{=\}0\.05, we partition patients into three tiers: low risk \(r<0\.25r<0\.25\), moderate risk \(0\.25≤r<0\.600\.25\\leq r<0\.60\), and high risk \(r≥0\.60r\\geq 0\.60\)\. Table[5](https://arxiv.org/html/2606.09860#S4.T5)summarizes the characteristics and outcomes of each tier on the external validation cohort\. The low\-risk tier \(n=186n\{=\}186, 45\.1% of the cohort\) has an observed NAFLD prevalence of 5\.4%, while the high\-risk tier \(n=112n\{=\}112, 27\.2%\) has a prevalence of 76\.8%\. Among the 78 high\-risk patients who had follow\-up data at 12 months, 33\.3% showed evidence of disease progression \(defined as worsening ultrasonographic grade or incident NASH\), compared to 7\.1% in the low\-risk tier—a 4\.7\-fold difference that supports the clinical utility of the stratification\.
Table 5:Risk stratification on the external validation cohort \(n=412n\{=\}412\)\. Progression rate is the fraction with worsening NAFLD grade at 12\-month follow\-up among those with available data\.
#### 4\.5\.2Comparison with clinical scores\.
Table[6](https://arxiv.org/html/2606.09860#S4.T6)comparesLiverRiskagainst three widely used clinical scoring systems on the external cohort\. The Fatty Liver Index \(FLI\)\[[5](https://arxiv.org/html/2606.09860#bib.bib7)\]achieves an AUROC of 0\.823, the Hepatic Steatosis Index \(HSI\)\[[21](https://arxiv.org/html/2606.09860#bib.bib8)\]reaches 0\.811, and the NAFLD Fibrosis Score \(NFS\)\[[2](https://arxiv.org/html/2606.09860#bib.bib33)\]attains 0\.764\. All three scores are substantially outperformed byLiverRisk\(AUROC 0\.891\), and the gap is even larger for the AUPRC, reflecting the ability of gradient boosting to leverage the high\-dimensional feature space and nonlinear interactions that fixed\-formula scores cannot capture\.
Table 6:Comparison ofLiverRiskwith clinical scoring systems on the external validation cohort\.
## 5Discussion
The experiments in this paper indicate that coupling gradient\-boosted decision trees with conformal prediction can yield a NAFLD risk model that is accurate, well calibrated, and equipped with distribution\-free coverage guarantees\. Several aspects of these findings merit elaboration\.
The observation thatLiverRiskoutperforms deep learning architectures \(DNN and TabNet\) on this moderately sized tabular dataset is consistent with a growing empirical consensus\[[16](https://arxiv.org/html/2606.09860#bib.bib35),[40](https://arxiv.org/html/2606.09860#bib.bib36)\]that tree\-based ensembles remain highly competitive for structured data, particularly when the sample size is in the low thousands and the feature space is rich in heterogeneous variable types\. DNNs and TabNet require careful regularization and data augmentation to avoid overfitting in this regime, and their internal attention or hidden\-layer representations, while flexible, do not exploit the axis\-aligned decision boundaries that tree models naturally produce for features like BMI thresholds or enzyme cutoffs\.
The conformal prediction layer adds a dimension of clinical utility that raw point predictions lack\. The prediction sets produced byLiverRiskare small \(average size 1\.12 on the combined test data\), indicating that the base model is already well calibrated in the sense that it rarely assigns comparable probabilities to both classes\. When ambiguity does arise—as reflected by a prediction set of size two—it tends to occur among patients in the metabolic borderline zone \(BMI 24–28, borderline triglycerides, mildly elevated ALT\), precisely the subpopulation where clinical decision\-making benefits most from an explicit uncertainty flag\. The conformalized risk score in Eq\. \([9](https://arxiv.org/html/2606.09860#S3.E9)\) incorporates this uncertainty into the stratification, routing ambiguous patients toward further workup rather than false reassurance\.
The mutual\-information stability selection procedure identifies a 14\-feature subset that aligns closely with established NAFLD risk factors\. The dominance of waist circumference over BMI in both selection probability and split importance is noteworthy: while BMI is the more commonly measured quantity in clinical practice, waist circumference is a more direct proxy for visceral adiposity, which drives hepatic lipogenesis through portal free fatty acid flux\[[45](https://arxiv.org/html/2606.09860#bib.bib32)\]\. The selection of HOMA\-IR and HbA1c alongside fasting glucose highlights the model’s sensitivity to insulin resistance pathways that precede overt diabetes\. These biological consistencies lend confidence that the model has captured genuine pathophysiological signals rather than dataset\-specific artifacts\.
Several limitations should be acknowledged\. First, the NAFLD diagnosis in our cohort is based on ultrasonography, which has limited sensitivity for mild steatosis and cannot distinguish simple steatosis from NASH without biopsy\. Misclassification of mild cases as controls would attenuate the observed AUROC, meaning our reported performance may underestimate the model’s true discriminative ability for clinically significant NAFLD\. Second, the exchangeability assumption underlying the conformal guarantee is an approximation: systematic differences in laboratory calibration or patient demographics between centers violate strict exchangeability\. The empirical coverage results suggest that these violations are mild in our setting, but they may become more pronounced in populations with substantially different ethnic composition or healthcare access patterns\. Third, the 12\-month follow\-up data used for the progression analysis are available for only a subset of the external cohort \(n=287n\{=\}287of 412\), introducing potential selection bias\. A prospective longitudinal study with protocolized follow\-up would provide more definitive evidence for the prognostic value of the risk tiers\.
Future work could extendLiverRiskin several directions\. Group\-conditional conformal prediction\[[4](https://arxiv.org/html/2606.09860#bib.bib20)\]would allow subgroup\-specific coverage control, which is particularly relevant for ensuring equity across sex and age strata\. Conformalized survival analysis\[[7](https://arxiv.org/html/2606.09860#bib.bib22)\]could extend the binary classification framework to time\-to\-event prediction of NAFLD progression\. On the feature selection front, conditional mutual information\[[14](https://arxiv.org/html/2606.09860#bib.bib31)\]could be integrated into the stability selection loop to account for redundancy among selected features, potentially yielding an even more parsimonious set\. Finally, integration with electronic health record systems and deployment as a point\-of\-care decision support tool would require addressing operational considerations such as real\-time data ingestion, model updating under distributional drift, and clinician\-facing interface design\.
## 6Conclusion
We have presentedLiverRisk, a framework for NAFLD risk prediction that integrates LightGBM\-based gradient boosting, split conformal prediction with provable marginal coverage, and mutual\-information stability selection with finite\-sample error control on spurious feature inclusion\. Evaluated on a multicenter Guangzhou health examination cohort,LiverRiskachieves an AUROC of 0\.912 internally and 0\.891 on an external validation set, outperforming deep neural networks, attention\-based tabular models, and classical machine\-learning baselines\. The conformal prediction sets attain empirical coverage of 91\.3% at the nominal 90% level, and the derived three\-tier risk stratification identifies a high\-risk subgroup whose 12\-month progression rate is 4\.7 times that of the low\-risk tier\. The 14\-feature subset selected by stability selection is dominated by waist circumference, liver enzymes, and metabolic markers—variables with well\-established roles in NAFLD pathophysiology—supporting the biological plausibility of the model\. These results suggest that conformally calibrated gradient boosting offers a practical path toward rigorous, uncertainty\-aware clinical risk prediction for NAFLD and potentially for other metabolic diseases where population\-level screening from routine health data is both feasible and clinically impactful\.
## References
- \[1\]\(2021\)A gentle introduction to conformal prediction and distribution\-free uncertainty quantification\.arXiv preprint arXiv:2107\.07511\.Cited by:[§2\.2](https://arxiv.org/html/2606.09860#S2.SS2.p1.1)\.
- \[2\]P\. Angulo, J\. M\. Hui, G\. Marchesini, E\. Bugianesi, J\. George, G\. C\. Farrell, F\. Enders, S\. Saksena, A\. D\. Burt, J\. P\. Bida,et al\.\(2007\)The nafld fibrosis score: a noninvasive system that identifies liver fibrosis in patients with nafld\.Hepatology45\(4\),pp\. 846–854\.Cited by:[§4\.5\.2](https://arxiv.org/html/2606.09860#S4.SS5.SSS2.p1.1)\.
- \[3\]S\. O\. Arik and T\. Pfister\(2021\)TabNet: attentive interpretable tabular learning\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.35,pp\. 6679–6687\.Cited by:[§2\.1](https://arxiv.org/html/2606.09860#S2.SS1.p1.1)\.
- \[4\]R\. F\. Barber, E\. J\. Candès, A\. Ramdas, and R\. J\. Tibshirani\(2023\)Conformal prediction beyond exchangeability\.Annals of Statistics51\(2\),pp\. 816–845\.Cited by:[§2\.2](https://arxiv.org/html/2606.09860#S2.SS2.p1.1),[§5](https://arxiv.org/html/2606.09860#S5.p6.1)\.
- \[5\]G\. Bedogni, S\. Bellentani, L\. Miglioli, F\. Masutti, M\. Passalacqua, A\. Castiglione, and C\. Tiribelli\(2006\)The fatty liver index: a simple and accurate predictor of hepatic steatosis in the general population\.BMC Gastroenterology6\(1\),pp\. 1–7\.Cited by:[§1](https://arxiv.org/html/2606.09860#S1.p2.1),[§4\.5\.2](https://arxiv.org/html/2606.09860#S4.SS5.SSS2.p1.1)\.
- \[6\]L\. Breiman\(2001\)Random forests\.Machine Learning45\(1\),pp\. 5–32\.Cited by:[§2\.3](https://arxiv.org/html/2606.09860#S2.SS3.p1.6)\.
- \[7\]E\. J\. Candès, L\. Lei, and Z\. Ren\(2023\)Conformalized survival analysis\.Journal of the Royal Statistical Society: Series B85\(1\),pp\. 24–45\.Cited by:[§1](https://arxiv.org/html/2606.09860#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.09860#S2.SS2.p1.1),[§5](https://arxiv.org/html/2606.09860#S5.p6.1)\.
- \[8\]N\. Chalasani, Z\. Younossi, J\. E\. Lavine, M\. Charlton, K\. Cusi, M\. Rinella, S\. A\. Harrison, E\. M\. Brunt, and A\. J\. Sanyal\(2018\)The diagnosis and management of nonalcoholic fatty liver disease: practice guidance from the american association for the study of liver diseases\.Hepatology67\(1\),pp\. 328–357\.Cited by:[§4\.1](https://arxiv.org/html/2606.09860#S4.SS1.p1.3)\.
- \[9\]T\. Chen and C\. Guestrin\(2016\)XGBoost: a scalable tree boosting system\.InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,pp\. 785–794\.Cited by:[§2\.1](https://arxiv.org/html/2606.09860#S2.SS1.p1.1),[§3\.2](https://arxiv.org/html/2606.09860#S3.SS2.p1.5)\.
- \[10\]Z\. Chen, Y\. Hu, Z\. Li, Z\. Fu, X\. Song, and L\. Nie\(2025\)OFFSET: segmentation\-based focus shift revision for composed image retrieval\.InProceedings of the ACM International Conference on Multimedia \(ACM MM\),pp\. 6113–6122\.Cited by:[§1](https://arxiv.org/html/2606.09860#S1.p1.1)\.
- \[11\]Z\. Chen, Y\. Hu, Z\. Li, Z\. Fu, H\. Wen, and W\. Guan\(2025\)HUD: hierarchical uncertainty\-aware disambiguation network for composed video retrieval\.InProceedings of the ACM International Conference on Multimedia \(ACM MM\),pp\. 6143–6152\.Cited by:[§1](https://arxiv.org/html/2606.09860#S1.p1.1)\.
- \[12\]Z\. Chen, Y\. Hu, Z\. Fu, Z\. Li, J\. Huang, Q\. Huang, and Y\. Wei\(2026\)INTENT: invariance and discrimination\-aware noise mitigation for robust composed image retrieval\.InProceedings of the AAAI Conference on Artificial Intelligence,Cited by:[§1](https://arxiv.org/html/2606.09860#S1.p1.1)\.
- \[13\]T\. M\. Cover and J\. A\. Thomas\(2006\)Elements of information theory\.2nd edition,John Wiley & Sons\.Cited by:[§2\.3](https://arxiv.org/html/2606.09860#S2.SS3.p1.6)\.
- \[14\]F\. Fleuret\(2004\)Fast binary feature selection with conditional mutual information\.Journal of Machine Learning Research5,pp\. 1531–1555\.Cited by:[§5](https://arxiv.org/html/2606.09860#S5.p6.1)\.
- \[15\]Z\. Fu, Y\. Hu, Q\. Yang, S\. Zhang, Z\. Chen, and Z\. Li\(2026\)Air\-know: arbiter\-calibrated knowledge\-internalizing robust network for composed image retrieval\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),Cited by:[§1](https://arxiv.org/html/2606.09860#S1.p2.1)\.
- \[16\]L\. Grinsztajn, E\. Oyallon, and G\. Varoquaux\(2022\)Why do tree\-based models still outperform deep learning on typical tabular data?\.Advances in Neural Information Processing Systems35,pp\. 507–520\.Cited by:[§5](https://arxiv.org/html/2606.09860#S5.p2.1)\.
- \[17\]I\. Guyon, J\. Weston, S\. Barnhill, and V\. Vapnik\(2002\)Gene selection for cancer classification using support vector machines\.Machine Learning46\(1\),pp\. 389–422\.Cited by:[§2\.3](https://arxiv.org/html/2606.09860#S2.SS3.p1.6)\.
- \[18\]Y\. Hu, Z\. Li, Z\. Chen, Q\. Huang, Z\. Fu, M\. Xu, and L\. Nie\(2026\)REFINE: composed video retrieval via shared and differential semantics enhancement\.ACM Transactions on Multimedia Computing, Communications and Applications\.Cited by:[§1](https://arxiv.org/html/2606.09860#S1.p1.1)\.
- \[19\]G\. Ke, Q\. Meng, T\. Finley, T\. Wang, W\. Chen, W\. Ma, Q\. Ye, and T\. Y\. Liu\(2017\)LightGBM: a highly efficient gradient boosting decision tree\.InAdvances in Neural Information Processing Systems,Vol\.30,pp\. 3146–3154\.Cited by:[§2\.1](https://arxiv.org/html/2606.09860#S2.SS1.p1.1),[§3\.2](https://arxiv.org/html/2606.09860#S3.SS2.p1.1)\.
- \[20\]A\. Kraskov, H\. Stögbauer, and P\. Grassberger\(2004\)Estimating mutual information\.Physical Review E69\(6\),pp\. 066138\.Cited by:[§1](https://arxiv.org/html/2606.09860#S1.p4.1),[§2\.3](https://arxiv.org/html/2606.09860#S2.SS3.p1.6),[§3\.4](https://arxiv.org/html/2606.09860#S3.SS4.p1.7)\.
- \[21\]J\. H\. Lee, D\. Kim, H\. J\. Kim, C\. H\. Lee, J\. I\. Yang, W\. Kim, Y\. J\. Kim, J\. H\. Yoon, S\. H\. Cho, M\. W\. Sung,et al\.\(2010\)Hepatic steatosis index: a simple screening tool reflecting nonalcoholic fatty liver disease\.Digestive and Liver Disease42\(7\),pp\. 503–508\.Cited by:[§1](https://arxiv.org/html/2606.09860#S1.p2.1),[§4\.5\.2](https://arxiv.org/html/2606.09860#S4.SS5.SSS2.p1.1)\.
- \[22\]B\. Li, H\. Dong, D\. Zhang, Z\. Zhao, J\. Gao, and X\. Li\(2025\)Exploring efficient open\-vocabulary segmentation in the remote sensing\.arXiv preprint arXiv:2509\.12040\.Cited by:[§1](https://arxiv.org/html/2606.09860#S1.p1.1)\.
- \[23\]B\. Li, T\. Huo, D\. Zhang, Z\. Zhao, J\. Gao, and X\. Li\(2025\)Exploring the underwater world segmentation without extra training\.arXiv preprint arXiv:2511\.07923\.Cited by:[§1](https://arxiv.org/html/2606.09860#S1.p1.1)\.
- \[24\]B\. Li, F\. Wang, D\. Zhang, Z\. Zhao, J\. Gao, and X\. Li\(2025\)MARIS: marine open\-vocabulary instance segmentation with geometric enhancement and semantic alignment\.arXiv preprint arXiv:2510\.15398\.Cited by:[§1](https://arxiv.org/html/2606.09860#S1.p1.1)\.
- \[25\]B\. Li, D\. Zhang, Z\. Zhao, J\. Gao, and X\. Li\(2025\)StitchFusion: weaving any visual modalities to enhance multimodal semantic segmentation\.InProceedings of the ACM International Conference on Multimedia,pp\. 1308–1317\.Cited by:[§1](https://arxiv.org/html/2606.09860#S1.p1.1)\.
- \[26\]B\. Li, D\. Zhang, Z\. Zhao, J\. Gao, and X\. Li\(2025\)U3M: unbiased multiscale modal fusion model for multimodal semantic segmentation\.Pattern Recognition168,pp\. 111801\.Cited by:[§1](https://arxiv.org/html/2606.09860#S1.p2.1)\.
- \[27\]Z\. Li, Z\. Chen, H\. Wen, Z\. Fu, Y\. Hu, and W\. Guan\(2025\)Encoder: entity mining and modification relation binding for composed image retrieval\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 5101–5109\.Cited by:[§1](https://arxiv.org/html/2606.09860#S1.p1.1)\.
- \[28\]Z\. Li, Y\. Hu, Z\. Chen, Q\. Huang, G\. Qiu, Z\. Fu, and M\. Liu\(2026\)ReTrack: evidence\-driven dual\-stream directional anchor calibration network for composed video retrieval\.InProceedings of the AAAI Conference on Artificial Intelligence,Cited by:[§1](https://arxiv.org/html/2606.09860#S1.p1.1)\.
- \[29\]Z\. Li, Y\. Hu, Z\. Chen, M\. Zhang, Z\. Fu, and L\. Nie\(2026\)ConeSep: cone\-based robust noise\-unlearning compositional network for composed image retrieval\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),Cited by:[§1](https://arxiv.org/html/2606.09860#S1.p2.1)\.
- \[30\]Z\. Li, Y\. Hu, Z\. Chen, S\. Zhang, Q\. Huang, Z\. Fu, and Y\. Wei\(2026\)HABIT: chrono\-synergia robust progressive learning framework for composed image retrieval\.InProceedings of the AAAI Conference on Artificial Intelligence,Cited by:[§1](https://arxiv.org/html/2606.09860#S1.p1.1)\.
- \[31\]C\. Lu, A\. Lemay, K\. C\. Chang, C\. Höbel, and P\. Golland\(2022\)Fair conformal predictors for applications in medical imaging\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.36,pp\. 12008–12016\.Cited by:[§1](https://arxiv.org/html/2606.09860#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.09860#S2.SS2.p1.1)\.
- \[32\]H\. Ma, C\. Xu, Z\. Shen, C\. Yu, and Y\. Li\(2021\)Application of machine learning techniques for clinical predictive modeling: a cross\-sectional study on nonalcoholic fatty liver disease in china\.BioMed Research International,pp\. 1–8\.Cited by:[§1](https://arxiv.org/html/2606.09860#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.09860#S2.SS1.p1.1)\.
- \[33\]M\. Maximos, F\. Bril, P\. P\. Sanchez, R\. Lomonaco, B\. Orsak, D\. Biernacki, A\. Suman, M\. Weber, and K\. Cusi\(2015\)The role of liver fat and insulin resistance as determinants of plasma aminotransferase elevation in nonalcoholic fatty liver disease\.Hepatology61\(1\),pp\. 153–160\.Cited by:[§1](https://arxiv.org/html/2606.09860#S1.p2.1)\.
- \[34\]N\. Meinshausen and P\. Bühlmann\(2010\)Stability selection\.Journal of the Royal Statistical Society: Series B72\(4\),pp\. 417–473\.Cited by:[§1](https://arxiv.org/html/2606.09860#S1.p4.1),[§2\.3](https://arxiv.org/html/2606.09860#S2.SS3.p1.6),[§3\.4](https://arxiv.org/html/2606.09860#S3.SS4.p2.9)\.
- \[35\]H\. Papadopoulos, K\. Proedrou, V\. Vovk, and A\. Gammerman\(2002\)Inductive confidence machines for regression\.InEuropean Conference on Machine Learning,pp\. 345–356\.Cited by:[§2\.2](https://arxiv.org/html/2606.09860#S2.SS2.p1.1),[§3\.3](https://arxiv.org/html/2606.09860#S3.SS3.p1.5)\.
- \[36\]J\. Platt\(1999\)Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods\.Advances in Large Margin Classifiers,pp\. 61–74\.Cited by:[§4\.2](https://arxiv.org/html/2606.09860#S4.SS2.p2.1)\.
- \[37\]A\. Sarkar, M\. Y\. I\. Idris, and Z\. Yu\(2025\)Reasoning in computer vision: taxonomy, models, tasks, and methodologies\.arXiv preprint arXiv:2508\.10523\.Cited by:[§1](https://arxiv.org/html/2606.09860#S1.p2.1)\.
- \[38\]G\. Shafer and V\. Vovk\(2008\)A tutorial on conformal prediction\.Journal of Machine Learning Research9,pp\. 371–421\.Cited by:[§1](https://arxiv.org/html/2606.09860#S1.p3.1)\.
- \[39\]R\. D\. Shah and R\. J\. Samworth\(2013\)Variable selection with error control: another look at stability selection\.Journal of the Royal Statistical Society: Series B75\(1\),pp\. 55–80\.Cited by:[§2\.3](https://arxiv.org/html/2606.09860#S2.SS3.p1.6)\.
- \[40\]R\. Shwartz\-Ziv and A\. Armon\(2022\)Tabular data: deep learning is not all you need\.Information Fusion81,pp\. 84–90\.Cited by:[§5](https://arxiv.org/html/2606.09860#S5.p2.1)\.
- \[41\]K\. Song, Y\. Zhu, and Q\. Liu\(2022\)Deep learning methods for hepatological disease prediction from electronic health records\.Computer Methods and Programs in Biomedicine215,pp\. 106608\.Cited by:[§2\.1](https://arxiv.org/html/2606.09860#S2.SS1.p1.1)\.
- \[42\]J\. P\. Sowa, S\. Atmaca, R\. K\. Gieseler, and A\. Canbay\(2021\)A deep learning approach for detection of non\-alcoholic fatty liver disease\.Journal of Hepatology74,pp\. S165–S166\.Cited by:[§1](https://arxiv.org/html/2606.09860#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.09860#S2.SS1.p1.1)\.
- \[43\]V\. Vovk, A\. Gammerman, and G\. Shafer\(2005\)Algorithmic learning in a random world\.Cited by:[§1](https://arxiv.org/html/2606.09860#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.09860#S2.SS2.p1.1),[§3\.3](https://arxiv.org/html/2606.09860#S3.SS3.p1.5)\.
- \[44\]M\. Xia, H\. Bian, and X\. Gao\(2021\)NAFLD\-related risk prediction models using machine learning: a systematic review\.Metabolism106,pp\. 154243\.Cited by:[§1](https://arxiv.org/html/2606.09860#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.09860#S2.SS1.p1.1)\.
- \[45\]S\. J\. Yu, W\. Kim, D\. Kim, H\. S\. Yoon, and J\. Lee\(2015\)Visceral obesity predicts significant fibrosis in patients with nonalcoholic fatty liver disease\.Medicine94\(48\),pp\. e2159\.Cited by:[§4\.3\.1](https://arxiv.org/html/2606.09860#S4.SS3.SSS1.p1.1),[§5](https://arxiv.org/html/2606.09860#S5.p4.1)\.
- \[46\]Z\. Yu, M\. Y\. I\. Idris, H\. Wang, P\. Wang, J\. Chen, and K\. Wang\(2025\)From physics to foundation models: a review of ai\-driven quantitative remote sensing inversion\.arXiv preprint arXiv:2507\.09081\.Cited by:[§1](https://arxiv.org/html/2606.09860#S1.p1.1)\.
- \[47\]Z\. Yu, M\. Y\. I\. Idris, P\. Wang, and R\. Qureshi\(2026\)DINOv3\-powered multi\-task foundation model for quantitative remote sensing estimation \(student abstract\)\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 41455–41456\.Cited by:[§1](https://arxiv.org/html/2606.09860#S1.p1.1)\.
- \[48\]Z\. Yu, H\. Jiang, P\. Wang, Z\. Lin, and Y\. Xiang\(2026\)Spatiotemporal alignment for remote sensing image recovery via terrain\-aware diffusion\.InICASSP 2026\-2026 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),pp\. 11257–11261\.Cited by:[§1](https://arxiv.org/html/2606.09860#S1.p2.1)\.
- \[49\]Z\. Yu, J\. Wang, H\. Chen, and M\. Y\. I\. Idris\(2025\)Qrs\-trs: style transfer\-based image\-to\-image translation for carbon stock estimation in quantitative remote sensing\.IEEE Access\.Cited by:[§1](https://arxiv.org/html/2606.09860#S1.p1.1)\.Similar Articles
Aligning Data-Driven Predictors with Allocation: A Decision-Focused Approach to Survival Analysis
This paper introduces a decision-focused learning approach for survival analysis that aligns predictive models with downstream allocation decisions, using NDCG optimization. Applied to US heart transplant data, it improves ranking performance by 50-100%, potentially yielding thousands of additional life-years annually.
Calibration, Uncertainty Communication, and Deployment Readiness in CKD Risk Prediction: A Framework Evaluation Study
This study evaluates five machine learning classifiers for chronic kidney disease risk prediction, finding that near-perfect internal performance fails under distribution shift. It emphasizes the need for calibration stability and conformal coverage transfer before clinical deployment.
Online Localized Conformal Prediction
This paper proposes Online Localized Conformal Prediction (OLCP) to address covariate heterogeneity in online learning and time-series settings. It introduces OLCP-Hedge for bandwidth selection and demonstrates valid long-run coverage with narrower prediction sets compared to existing baselines.
Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations
This paper proposes a conformal prediction framework for LLMs that leverages internal representations rather than output-level statistics, introducing Layer-Wise Information (LI) scores as nonconformity measures to improve validity-efficiency trade-offs under distribution shift. The method demonstrates stronger robustness to calibration-deployment mismatch compared to text-level baselines across QA benchmarks.
LLMs for Cardiovascular Risk Prediction from Structured Clinical Data
This paper presents a hybrid framework that combines structured clinical data with LLM-generated narratives for coronary artery disease prediction, achieving high fidelity in variable extraction and comparing ML models with LLM-based zero-shot and few-shot classification.