Calibration, Uncertainty Communication, and Deployment Readiness in CKD Risk Prediction: A Framework Evaluation Study

arXiv cs.LG Papers

Summary

This study evaluates five machine learning classifiers for chronic kidney disease risk prediction, finding that near-perfect internal performance fails under distribution shift. It emphasizes the need for calibration stability and conformal coverage transfer before clinical deployment.

arXiv:2605.21566v1 Announce Type: new Abstract: Machine learning models for chronic kidney disease (CKD) risk prediction often post strong discrimination scores on internal test sets. Calibration and uncertainty quantification get far less attention, leaving clinicians without reliable information about whether the probability outputs are accurate. We trained five classifiers on the UCI CKD dataset (400 patients, 62.5% CKD prevalence): logistic regression, random forest, XGBoost, SVM with Platt scaling, and Gaussian naive Bayes. We evaluated each across calibration quality, conformal prediction coverage, and an eight-criterion deployment readiness framework. A distributional stress-test applied the best-calibrated variant of each model to the open-access MIMIC-IV demo cohort (97 patients, 23.7% CKD) to assess behaviour under prevalence shift and feature missingness. We measured calibration before and after Platt scaling and isotonic regression using Expected Calibration Error and Brier Score, and quantified uncertainty through split conformal prediction targeting 90% marginal coverage. All five models reached AUROC 1.00 on the UCI test set. Isotonic recalibration reduced internal ECE to 0.000-0.022. On MIMIC-IV, AUROC fell to 0.48-0.58, ECE rose to 0.68-0.76, and conformal coverage dropped from 0.80-0.98 to 0.21-0.25 against a 90% target. No model scored above 4 out of 16 on the deployment readiness checklist. Near-perfect internal performance did not transfer. Calibration stability and conformal coverage should be evaluated on external data before any clinical prediction model moves toward deployment.
Original Article
View Cached Full Text

Cached at: 05/22/26, 08:49 AM

# Calibration, Uncertainty Communication, and Deployment Readiness in CKD Risk Prediction: A Framework Evaluation Study
Source: [https://arxiv.org/html/2605.21566](https://arxiv.org/html/2605.21566)
###### Abstract

Background\.Machine learning models for chronic kidney disease risk prediction regularly achieve strong discrimination on internal test sets\. Calibration assessment and uncertainty quantification are far less common, leaving clinicians without reliable information about whether probability outputs are trustworthy\. No published study has jointly evaluated all three dimensions \(calibration, uncertainty, and structured deployment readiness\) on a common model suite with external clinical validation\.

Objective\.To evaluate five classifiers across calibration quality, conformal prediction coverage, and an eight\-criterion deployment readiness framework on both internal and external data\.

Methods\.Five classifiers \(logistic regression, random forest, XGBoost, support vector machine with Platt scaling, Gaussian naive Bayes\) were trained on the UCI CKD dataset \(400 patients, 62\.5% CKD\)\. A distributional stress\-test used the open\-access MIMIC\-IV demo cohort \(97 patients, 23\.7% CKD\) to evaluate model behaviour under prevalence shift and feature missingness\. Calibration was assessed before and after Platt scaling and isotonic regression, quantified by Expected Calibration Error and Brier Score\. Predictive uncertainty was measured through split conformal prediction targeting 90% marginal coverage\. An eight\-criterion deployment readiness framework evaluated discrimination, calibration stability, coverage transfer, subgroup equity, and reproducibility\.

Results\.All five models achieved AUROC 1\.00 on the UCI test set\. Post\-isotonic ECE fell to 0\.000–0\.022 internally\. On MIMIC\-IV, AUROC dropped to 0\.48–0\.58, ECE rose to 0\.68–0\.76, and conformal coverage collapsed from 0\.80–0\.98 \(UCI\) to 0\.21–0\.25, well below the 90% target\. No model passed the deployment checklist; scores ranged from 2 to 4 out of 16\.

Conclusion\.Near\-perfect internal performance did not survive distributional shift\. Calibration stability and conformal coverage transfer should be evaluated before any clinical ML model moves toward deployment, even when internal metrics appear strong\.

Keywords:chronic kidney disease, probability calibration, conformal prediction, uncertainty quantification, deployment readiness, machine learning

## 1Introduction

Roughly 850 million people worldwide are estimated to have chronic kidney disease, and the global prevalence grew by 33% between 1990 and 2017\[[9](https://arxiv.org/html/2605.21566#bib.bib1),[13](https://arxiv.org/html/2605.21566#bib.bib2)\]\. Among people with diabetes, as many as 1 in 3 are affected; among those with hypertension in high\-income settings, the proportion reaches approximately 1 in 5\[[9](https://arxiv.org/html/2605.21566#bib.bib1)\]\. By 2040, CKD is projected to rank fifth among leading causes of years of life lost globally\[[9](https://arxiv.org/html/2605.21566#bib.bib1)\]\. Those numbers create genuine pressure on health systems to identify high\-risk patients early, before irreversible loss of kidney function closes off the best treatment options\.

Machine learning has been proposed as a practical answer to this challenge\. Models trained on electronic health records, biomarker panels, and demographic variables have reported AUROC values above 0\.95 in national cohort studies\[[14](https://arxiv.org/html/2605.21566#bib.bib6),[2](https://arxiv.org/html/2605.21566#bib.bib7),[15](https://arxiv.org/html/2605.21566#bib.bib8),[23](https://arxiv.org/html/2605.21566#bib.bib9)\]\. Established risk equations like the Kidney Failure Risk Equation have been validated across populations in North America, the United Kingdom, and Latin America, confirming algorithmic CKD risk prediction is technically achievable\[[25](https://arxiv.org/html/2605.21566#bib.bib3),[17](https://arxiv.org/html/2605.21566#bib.bib4),[4](https://arxiv.org/html/2605.21566#bib.bib5)\]\. The field has not struggled to build models\. The struggle is with what happens after the model is built\.

Discrimination metrics like AUROC measure whether a model ranks patients correctly relative to each other\. They say nothing about whether the assigned probability scores are trustworthy in absolute terms\. A model posting AUROC 0\.97 assigns a 65% risk score to patients whose true event rate sits near 20%, and clinicians making treatment decisions from those numbers are working from miscalibrated information\. Van Calster and colleagues identified calibration as the Achilles heel of predictive analytics, noting poor calibration regularly persists even when discrimination appears strong\[[27](https://arxiv.org/html/2605.21566#bib.bib10)\]\. A systematic review of CKD risk models by Echouffo\-Tcheugui and Kengne found calibration is assessed less commonly than discrimination across the published literature; of all the models reviewed, only eight for CKD occurrence and five for CKD progression had been externally validated for calibration\[[8](https://arxiv.org/html/2605.21566#bib.bib11)\]\. The models exist\. The evidence for trusting their probability outputs largely does not\.

The problem runs deeper than calibration alone\. Campagner and colleagues reviewed machine learning studies in healthcare and found fewer than 4% address uncertainty quantification explicitly\[[5](https://arxiv.org/html/2605.21566#bib.bib17)\]\. A model outputting a probability without any indication of how much to trust the number puts clinicians in a difficult position\. Banerji and colleagues put this plainly: clinical AI tools must communicate predictive uncertainty at the level of the individual patient, not purely in aggregate performance statistics\[[3](https://arxiv.org/html/2605.21566#bib.bib18)\]\. A predicted CKD risk of 78% warrants a different clinical response when the model’s uncertainty is narrow than when high uncertainty renders the output practically unreliable\.

A 2023 systematic review commissioned to support CDC prevention guidelines reached a pointed conclusion: CKD risk prediction models need to be better calibrated and externally validated before incorporation into clinical guidelines\[[10](https://arxiv.org/html/2605.21566#bib.bib30)\]\. No published study in the CKD literature has operationalized all three of these demands together\. Current guidance for clinical prediction model evaluation calls for joint assessment of discrimination, calibration, fairness, and generalizability before deployment consideration\[[6](https://arxiv.org/html/2605.21566#bib.bib24)\], and the design of external validation studies requires attention to population comparability and feature completeness\[[20](https://arxiv.org/html/2605.21566#bib.bib25)\]\. No existing work has jointly evaluated calibration across multiple post\-hoc correction methods, quantified uncertainty through a coverage\-guaranteed conformal framework, and assessed deployment readiness through a structured multi\-criterion checklist, all on the same model suite and across an independent external cohort\.

This study addresses the gap\. Using the UCI CKD dataset for model development and MIMIC\-IV as an external validation cohort, we trained five classifiers spanning the range commonly used in clinical prediction: logistic regression, random forest, gradient boosting via XGBoost, a support vector machine with Platt\-scaled probabilities, and Gaussian naive Bayes\. Each model was evaluated across three dimensions: calibration before and after post\-hoc recalibration using Platt scaling and isotonic regression; predictive uncertainty through split conformal prediction with a formal 90% marginal coverage guarantee; and a structured eight\-criterion deployment readiness framework grounded in current reporting standards including TRIPOD\+AI\[[7](https://arxiv.org/html/2605.21566#bib.bib23)\]\.

The study has three objectives:

1. 1\.Quantify pre\- and post\-calibration error for five CKD classifiers on both the internal UCI test set and the external MIMIC\-IV cohort\.
2. 2\.Apply split conformal prediction to generate prediction sets with a 90% coverage guarantee and determine whether the guarantee holds on external data\.
3. 3\.Score each model against an eight\-criterion deployment readiness checklist and identify which, if any, meet the threshold for responsible clinical use\.

## 2Methods

### 2\.1Datasets

Two datasets were used\. The UCI CKD dataset served as the primary training and internal validation source\[[22](https://arxiv.org/html/2605.21566#bib.bib28)\]\. It contains 400 patient records collected from a hospital in Vellore, India, each described by 24 clinical and laboratory features alongside a binary CKD label\. Of the 400 patients, 250 \(62\.5%\) carry a positive CKD diagnosis\. Mean patient age was 51\.6 years \(SD 17\.0\)\. Features include continuous measurements such as serum creatinine, blood urea, hemoglobin, sodium, potassium, and packed cell volume, plus categorical variables for comorbidities \(hypertension, diabetes, coronary artery disease\) and urinary findings \(red blood cell morphology, pus cells, bacteria\)\.

The MIMIC\-IV Clinical Database Demo \(version 2\.2\) provided a distributional stress\-test cohort\[[12](https://arxiv.org/html/2605.21566#bib.bib29)\]\. This is a publicly available, open\-access subset of MIMIC\-IV released by PhysioNet specifically for pipeline development and workshop use; it contains 100 de\-identified patients from Beth Israel Deaconess Medical Center in Boston and requires no credentialing\. It is not a formally designed external validation set, and its use here is deliberately framed as a stress\-test: the goal is to evaluate how each model behaves when applied to a population with different prevalence, missing features, and clinical context than the training data\.

Patients were included if serum creatinine was available on their first hospital admission, yielding 97 patients\. CKD labeling used the CKD\-EPI 2021 race\-free equation: eGFR below 60 mL/min/1\.73 m2defined CKD\-positive status\. That threshold produced 23 CKD cases \(23\.7%\) and 74 controls\. Mean age in the demo cohort was 61\.7 years \(SD 16\.3\)\. The dataset is fully de\-identified; no human subjects review was required\.

### 2\.2Preprocessing and Feature Harmonization

UCI preprocessing started with whitespace stripping across all categorical columns\. Categorical features were mapped to binary 0/1 values\. The target label was encoded as CKD = 1, notCKD = 0\. Missing values in continuous features were replaced with the column median; categorical missing values were replaced with the column mode\. Zero missing values remained after imputation across all 400 records\.

MIMIC harmonization used statistics from the UCI training fold only, to prevent any leakage from the external data into the model pipeline\. Seven features present in the UCI schema are not routinely recorded in MIMIC: urine\-specific gravity, urine sugar, pus cells \(categorical\), pus cell clumps, bacteria, appetite, and pedal edema\. Each was filled with the UCI training\-set median or mode as appropriate\. Blood pressure came from ICU chartevents \(item IDs 220179 and 220050, systolic range 60–250 mmHg\); where that source was missing, the MIMIC outpatient medical record table provided a fallback\. That two\-source approach resolved blood pressure for 94 of 97 patients\. Laboratory values including serum creatinine, blood urea nitrogen, hemoglobin, albumin, potassium, sodium, glucose, hematocrit, WBC, and RBC were extracted from the MIMIC lab events table and averaged across each patient’s first admission\. Comorbidity flags for hypertension, diabetes, and coronary artery disease came from ICD\-10 codes \(I10, E11\.x, I25\.x\)\. Anemia was defined as hemoglobin below 12 g/dL in females and below 13\.5 g/dL in males\.

### 2\.3Model Suite and Training

Records were split into training \(70%,nn= 279\), validation \(15%,nn= 60\), and test \(15%,nn= 61\) subsets using stratified random sampling \(random\_state = 42\)\. The MIMIC demo cohort was held back entirely as a stress\-test set, with no records used during any training, validation, or calibration step\.

Five classifier families were selected to span the calibration behaviors seen in clinical ML: logistic regression \(L2 regularization\), random forest \(ensemble, known for overconfident probabilities\), XGBoost \(gradient boosting, strong discrimination but typically miscalibrated\), support vector machine \(probability=True\), and Gaussian naive Bayes\[[18](https://arxiv.org/html/2605.21566#bib.bib13)\]\. Hyperparameter tuning used five\-fold stratified cross\-validation on the training fold, optimizing AUROC\. Logistic regressionCCwas tuned over \{0\.001, 0\.01, 0\.1, 1, 10, 100\}\. RF, XGB, and SVM used randomized search with up to 30 iterations over defined grids \(full grids in Supplementary S1\)\. Fitted models were saved with joblib\. Software: Python 3\.13, scikit\-learn, XGBoost, MAPIE 1\.3, netcal, pandas, numpy, matplotlib, joblib; full version pins in requirements\.txt\.

### 2\.4Calibration Evaluation

Pre\-calibration metrics were computed on the UCI validation set for each model: Expected Calibration Error \(ECE, 10 equal\-width bins\)\[[11](https://arxiv.org/html/2605.21566#bib.bib12)\], Maximum Calibration Error \(MCE\), Brier Score, and Brier Skill Score relative to a naive prevalence baseline\. ECE and MCE used the netcal library\. Reliability diagrams usedCalibrationDisplayfrom scikit\-learn\.

Two post\-hoc recalibration methods were fitted on the validation set\. Platt scaling fits a logistic regression layer on the base model’s raw scores\[[19](https://arxiv.org/html/2605.21566#bib.bib14)\]\(CalibratedClassifierCV,method=’sigmoid’,cv=’prefit’\)\. Isotonic regression fits a piecewise\-constant monotone function\[[30](https://arxiv.org/html/2605.21566#bib.bib15)\]\(method=’isotonic’\)\.FrozenEstimatorprevented any refitting of the base model in both cases\. The test set was not used at any point during calibration fitting\.

Post\-calibration metrics were computed on the UCI test set for the base, Platt\-scaled, and isotonic\-scaled variants\. For external validation, the best variant per model \(lowest ECE on the UCI test set\) was applied to MIMIC\. Calibration drift was MIMIC ECE minus UCI ECE for that variant\.

### 2\.5Uncertainty Quantification

Uncertainty was quantified through split conformal prediction using the MAPIE library \(SplitConformalClassifier, version 1\.3\)\[[26](https://arxiv.org/html/2605.21566#bib.bib20),[1](https://arxiv.org/html/2605.21566#bib.bib19)\]\. Each base model’s conformal predictor was fitted on the UCI validation set \(nn= 60\) using the least ambiguous class \(LAC\) conformity score: one minus the predicted probability of the most likely class\. Target confidence was 0\.90 \(α\\alpha= 0\.10\), so prediction sets should contain the true label for at least 90% of test cases\.

Three metrics were computed on both the UCI test set and the MIMIC cohort: empirical coverage rate, average prediction set size, and singleton rate \(the share of cases receiving exactly one class label\)\. Coverage drift was UCI coverage minus MIMIC coverage\.

### 2\.6Deployment Readiness Framework

Eight criteria were defined before analysis began, drawing on reporting standards for early\-stage clinical AI evaluation\[[28](https://arxiv.org/html/2605.21566#bib.bib27),[7](https://arxiv.org/html/2605.21566#bib.bib23)\]\. Each was scored PASS \(threshold met, 2 points\), MARGINAL \(within 20% of threshold, 1 point\), or FAIL \(0 points\), for a maximum total of 16 points\.

1. 1\.Discrimination adequacy: AUROC≥0\.85\\geq 0\.85on the external cohort\.
2. 2\.Calibration adequacy: ECE≤0\.10\\leq 0\.10on the external cohort\.
3. 3\.Calibration stability: absolute calibration drift≤0\.05\\leq 0\.05\.
4. 4\.Uncertainty coverage: conformal coverage≥0\.90\\geq 0\.90on the external cohort\.
5. 5\.Coverage stability: absolute coverage drift≤0\.05\\leq 0\.05\.
6. 6\.Prediction interpretability: singleton rate≥0\.70\\geq 0\.70on the external cohort\.
7. 7\.Subgroup calibration equity: maximum ECE gap across subgroups≤0\.05\\leq 0\.05\.
8. 8\.Transparency: full code and pipeline publicly available \(automatic PASS\)\.

Subgroup analysis stratified the MIMIC cohort by age \(below 65 vs\. 65 and above\), diabetes status, and hypertension status\. Groups with fewer than 10 patients were excluded from ECE computation because bin\-level estimates become unreliable at that scale\.

### 2\.7Statistical Analysis

All metrics are point estimates on held\-out test sets\. Bootstrap confidence intervals \(95%, 1000 resamples, random\_state = 42\) were computed for AUROC and ECE on both cohorts\. No hypothesis tests were run; the study is descriptive and evaluative\.

## 3Results

### 3\.1Cohort Characteristics

The UCI dataset included 400 patients, mean age 51\.6 years \(SD 17\.0\), CKD prevalence 62\.5% \(nn= 250\)\. Fourteen continuous and 10 binary categorical features were present\. Preprocessing left zero missing values\. The 70/15/15 split produced 279 training, 60 validation, and 61 test patients, each fold within one percentage point of the overall CKD prevalence\.

The MIMIC\-IV demo stress\-test cohort had 97 patients, mean age 61\.7 years \(SD 16\.3\)\. CKD prevalence was 23\.7% \(nn= 23\), substantially lower than the UCI training population \(62\.5%\), and reflecting a general hospital population rather than a nephrology referral center\. Seven of the 24 model features were entirely absent from the demo and were imputed using UCI training\-set statistics, meaning those columns carry no information specific to individual MIMIC patients\. Blood pressure was recovered for 94 of 97 patients through ICU chartevents and outpatient records\.

### 3\.2Baseline Discrimination

All five models reached AUROC 1\.00 on the UCI test set \(Table[1](https://arxiv.org/html/2605.21566#S3.T1)\)\. The dataset offers essentially no discrimination challenge once models are fitted\. RF and SVM posted perfect F1, accuracy, sensitivity, and specificity\. LR lagged on accuracy \(0\.77\) and specificity \(0\.39\), a pattern that points to a miscalibrated intercept pushing most probabilities above the decision threshold\. XGB and NB achieved F1 of 0\.987\. Cross\-validation AUROC during tuning was 0\.999–1\.000 across the board\. The UCI benchmark appears saturated\.

Table 1:Baseline discrimination on the UCI test set \(nn= 61\)\. All five models achieve AUROC 1\.00\. LR specificity of 0\.391 reflects a miscalibrated decision boundary rather than poor discrimination\.
### 3\.3Pre\-Calibration Assessment

Before any post\-hoc correction, calibration varied widely\. LR was the weakest: ECE 0\.263, MCE 0\.483, Brier Score 0\.164, Brier Skill Score 0\.295\. XGB was the strongest, with ECE 0\.031, Brier Score 0\.005, and a Brier Skill Score of 0\.977\. RF, SVM, and NB fell in between with ECE values of 0\.053, 0\.042, and 0\.050\. Reliability diagrams confirmed LR’s systematic bias: underestimation at low predicted values, overestimation at the high end\. XGB and SVM stayed close to the diagonal throughout \(Figures[1](https://arxiv.org/html/2605.21566#S3.F1)and[2](https://arxiv.org/html/2605.21566#S3.F2)\)\.

![Refer to caption](https://arxiv.org/html/2605.21566v1/figures/reliability_compare_LR.png)\(a\)LR
![Refer to caption](https://arxiv.org/html/2605.21566v1/figures/reliability_compare_RF.png)\(b\)RF
![Refer to caption](https://arxiv.org/html/2605.21566v1/figures/reliability_compare_XGB.png)\(c\)XGB

Figure 1:Reliability diagrams for LR, RF, and XGB before and after post\-hoc calibration on the UCI test set\. Each panel shows the uncalibrated model \(grey\), Platt scaling \(pink\), and isotonic regression \(teal\)\. The diagonal is perfect calibration\. LR shows the largest pre\-calibration deviation\. Isotonic regression achieves the largest ECE reductions\. \(Continued in Figure[2](https://arxiv.org/html/2605.21566#S3.F2)\.\)![Refer to caption](https://arxiv.org/html/2605.21566v1/figures/reliability_compare_SVM.png)\(a\)SVM
![Refer to caption](https://arxiv.org/html/2605.21566v1/figures/reliability_compare_NB.png)\(b\)NB

Figure 2:Reliability diagrams for SVM and NB before and after post\-hoc calibration on the UCI test set \(continued from Figure[1](https://arxiv.org/html/2605.21566#S3.F1)\)\. Platt scaling raised SVM calibration error rather than reducing it\. NB’s uncalibrated variant outperformed both post\-hoc variants\.
### 3\.4Post\-Calibration Results

Isotonic regression cut ECE across the full model suite \(Table[2](https://arxiv.org/html/2605.21566#S3.T2)\)\. LR dropped from 0\.345 to 0\.022 \(95% CI: 0\.002, 0\.050\), a reduction of 0\.323\. RF reached ECE 0\.000\. XGB went from 0\.019 to 0\.007 \(95% CI: 0\.000, 0\.021\)\. SVM improved from 0\.031 to 0\.015 \(95% CI: 0\.000, 0\.037\)\. For NB, the uncalibrated model \(ECE 0\.016, 95% CI: 0\.000, 0\.049\) outperformed both post\-hoc variants, so it was selected as NB’s best model\. Platt scaling produced inconsistent results: it helped LR but raised SVM’s ECE from 0\.031 to 0\.307, likely because the SVM’s decision scores and true probabilities do not have a monotone relationship on this dataset\.

### 3\.5Distributional Stress\-Test on MIMIC\-IV Demo

Applying the best\-calibrated variant of each model to the MIMIC demo cohort produced a sharp deterioration across every metric \(Table[2](https://arxiv.org/html/2605.21566#S3.T2), Figure[3](https://arxiv.org/html/2605.21566#S3.F3)\)\. This outcome is expected given the two distributional differences: a 39\-percentage\-point prevalence gap and seven features imputed from UCI statistics rather than measured from individual patients\. AUROC fell to 0\.485 \(LR\), 0\.507 \(RF\), 0\.579 \(XGB\), 0\.483 \(SVM\), and 0\.477 \(NB\), all near or below chance\. ECE reached 0\.761 for LR \(95% CI: 0\.673, 0\.844\), 0\.753 for RF \(95% CI: 0\.660, 0\.835\), 0\.680 for XGB \(95% CI: 0\.594, 0\.777\), 0\.755 for SVM \(95% CI: 0\.667, 0\.837\), and 0\.753 for NB \(95% CI: 0\.660, 0\.835\)\. Calibration drift ranged from 0\.673 \(XGB\) to 0\.753 \(RF\), far above the 0\.05 threshold\. XGB was the least poor performer on both AUROC and ECE, a pattern consistent with gradient boosting learning slightly more transferable representations under regularization\.

Table 2:Calibration summary for all five classifiers\. Best variant selected by lowest ECE on the UCI test set\. Bootstrap 95% CIs in brackets\. Calibration drift = MIMIC demo ECE−\-UCI ECE\. MIMIC results reflect distributional shift, not formal external validation\.![Refer to caption](https://arxiv.org/html/2605.21566v1/figures/reliability_mimic_LR.png)\(a\)LR
![Refer to caption](https://arxiv.org/html/2605.21566v1/figures/reliability_mimic_RF.png)\(b\)RF
![Refer to caption](https://arxiv.org/html/2605.21566v1/figures/reliability_mimic_XGB.png)\(c\)XGB
![Refer to caption](https://arxiv.org/html/2605.21566v1/figures/reliability_mimic_SVM.png)\(d\)SVM
![Refer to caption](https://arxiv.org/html/2605.21566v1/figures/reliability_mimic_NB.png)\(e\)NB

Figure 3:Reliability diagrams for the best\-calibrated variant of each classifier on the MIMIC\-IV demo stress\-test cohort\. All curves fall below the diagonal, consistent with models trained on a 62\.5% CKD prevalence population being applied to a 23\.7% prevalence cohort with seven imputed features\. ECE ranged from 0\.680 \(XGB\) to 0\.761 \(LR\)\. Results demonstrate distributional failure, not clinical deployment performance\.
### 3\.6Conformal Prediction Coverage

On the UCI test set, four of five models met the 90% coverage target \(Table[3](https://arxiv.org/html/2605.21566#S3.T3), Figures[4](https://arxiv.org/html/2605.21566#S3.F4)and[5](https://arxiv.org/html/2605.21566#S3.F5)\)\. NB achieved 0\.984 coverage with a singleton rate of 1\.00\. RF and XGB both reached 0\.967 coverage and singleton rates of 0\.967\. SVM achieved 0\.918\. LR fell short at 0\.803, consistent with its miscalibrated probabilities undermining the conformity scores\.

Coverage collapsed on MIMIC\. NB was highest at 0\.247; the others ranged from 0\.206 to 0\.237\. All five models fell below 0\.30 against a 0\.90 target\. Coverage drift ranged from 0\.566 \(LR\) to 0\.761 \(RF\)\. The conformal guarantee holds only when calibration and test data come from the same distribution\. These two cohorts are not exchangeable in that sense\.

Table 3:Conformal prediction results on the UCI test set and MIMIC\-IV demo stress\-test cohort \(target coverage 0\.90\)\. Coverage drift = UCI coverage−\-MIMIC coverage\. Coverage collapse on MIMIC reflects distributional non\-exchangeability, not a failure of the conformal method itself\.![Refer to caption](https://arxiv.org/html/2605.21566v1/figures/conformal_setsize_uci.png)Figure 4:Prediction set size distribution \(UCI test set\)\. Bar charts of set size \(0 = empty, 1 = singleton, 2 = ambiguous\) per model\. NB and RF produce the fewest ambiguous predictions\. Coverage target is 0\.90\. \(Individual\-level display in Figure[5](https://arxiv.org/html/2605.21566#S3.F5)\.\)![Refer to caption](https://arxiv.org/html/2605.21566v1/figures/conformal_sample_uci.png)Figure 5:Individual\-level uncertainty display for NB \(top 50 UCI test patients, sorted by descending predicted probability\)\. Bar colour reflects prediction set membership: red = CKD only, blue = not\-CKD only, yellow = ambiguous \(both classes\)\. Markers indicate true label\. Coverage target is 0\.90\. \(Set size distribution in Figure[4](https://arxiv.org/html/2605.21566#S3.F4)\.\)
### 3\.7Deployment Readiness Scores

No model passed the checklist \(Table[4](https://arxiv.org/html/2605.21566#S3.T4), Figure[6](https://arxiv.org/html/2605.21566#S3.F6)\)\. Scores ran from 2 out of 16 \(XGB\) to 4 out of 16 \(LR, RF, SVM, NB\)\. The only passing criteria were prediction interpretability on UCI \(singleton rate above 0\.70\) and transparency\. Every model failed discrimination on MIMIC, calibration adequacy on MIMIC, calibration stability, conformal coverage on MIMIC, coverage stability, and subgroup equity\. Subgroup ECE gaps across age strata ranged from 0\.148 to 0\.209, far above the 0\.05 equity threshold\. XGB scored lowest \(2/16\) because its MIMIC singleton rate of 0\.474 also failed the interpretability criterion; the other four models passed interpretability on UCI but not on MIMIC\.

Table 4:Deployment readiness scores for all five classifiers across eight criteria\. P = PASS \(2 pts\), M = MARGINAL \(1 pt\), F = FAIL \(0 pts\)\. Maximum score is 16\.![Refer to caption](https://arxiv.org/html/2605.21566v1/figures/F4_deployment_heatmap.png)Figure 6:Deployment readiness heatmap\. Rows are models; columns are the eight criteria\. Green = PASS, orange = MARGINAL, red = FAIL\. Total scores out of 16 are shown on the right\. All models score 2–4 out of 16\. The only passing criteria are interpretability \(C6\) and transparency \(C8\)\. Criteria C1–C7 that reference the stress\-test cohort all fail, illustrating that internal performance does not predict behaviour under distribution shift\.

## 4Discussion

### 4\.1What the Distributional Stress\-Test Reveals

Every model in this study achieved AUROC 1\.00 on the internal UCI test set\. By the standard metrics that dominate published clinical ML literature, all five would be described as performing excellently\. When the same models were applied to the MIMIC\-IV demo cohort, AUROC fell to values indistinguishable from chance \(0\.48–0\.58\), ECE exceeded 0\.68 for every model, and conformal coverage dropped from near\-target to 0\.21–0\.25\.

Before interpreting these numbers, the stress\-test framing matters\. The MIMIC demo is an open\-access 100\-patient subset released for pipeline development and workshop use, not a formally designed validation cohort\. Its CKD prevalence \(23\.7%\) differs from the UCI training population \(62\.5%\) by 39 percentage points, and seven of 24 model features are entirely absent from the demo data\. The purpose of including it is to illustrate what happens when models encounter a population that does not resemble their training data, not to estimate clinical deployment performance\.

That said, the pattern is exactly what calibration theory predicts\. Isotonic recalibration that achieved ECE 0\.000 internally did not survive the shift\. Echouffo\-Tcheugui and Kengne found external calibration validation to be the exception rather than the norm in CKD prediction literature\[[8](https://arxiv.org/html/2605.21566#bib.bib11)\]\. Liou and colleagues documented a similar gap in a deployed malnutrition prediction model, finding calibration drift and subgroup bias that required post\-deployment recalibration within a large healthcare system\[[16](https://arxiv.org/html/2605.21566#bib.bib16)\]\. This study provides a concrete, reproducible illustration of why that gap matters: perfect internal metrics say nothing about behaviour under distribution shift\.

### 4\.2What Drove the External Failure

Three factors contributed to the calibration collapse\. The first is prevalence shift\. A model trained and calibrated on a 62\.5% CKD population assigns probabilities near the training prevalence\. The MIMIC cohort sits at 23\.7% CKD, so those probability estimates are systematically too high, producing calibration curves that fall above the diagonal at every predicted probability level\. This is the most direct driver of high MIMIC ECE\.

The second factor is feature missingness\. Seven of the 24 features in the UCI schema are not routinely recorded in MIMIC\. Urine\-specific gravity, urine sugar, pus cells, pus cell clumps, bacteria, appetite, and pedal edema were all imputed using UCI training\-set medians and modes\. Those imputed values carry no information about individual MIMIC patients\. From the model’s perspective, seven features are effectively noise columns on the external cohort\.

The third factor is dataset saturation\. AUROC 1\.00 on a 61\-patient test set, with near\-perfect cross\-validation scores, signals that the UCI benchmark does not offer a realistic discrimination challenge\. The learned decision boundaries are sharp enough to separate every patient in the training domain\. Those boundaries do not generalize to a population with a different disease severity distribution and different measurement patterns\.

XGB was the least poor performer externally \(AUROC 0\.579, ECE 0\.680\)\. That small margin appears to reflect gradient boosting’s regularization limiting overfitting to UCI\-specific patterns\. Even so, the margin is not large enough to change the practical conclusion\.

### 4\.3Conformal Prediction as a Diagnostic Tool

The conformal coverage collapse offers something calibration numbers alone do not: a direct, interpretable signal of distribution shift\. Conformal prediction has been applied in clinical settings precisely because it provides this kind of interpretable coverage guarantee, including recent applications to individual\-level diagnostic uncertainty in chronic disease\[[29](https://arxiv.org/html/2605.21566#bib.bib21),[24](https://arxiv.org/html/2605.21566#bib.bib22)\]\. The theoretical guarantee of split conformal prediction holds only when calibration and test data come from the same distribution\[[1](https://arxiv.org/html/2605.21566#bib.bib19)\]\. Coverage of 0\.22–0\.25 on MIMIC, against a 0\.90 target, is a quantitative statement that the MIMIC population is not exchangeable with the UCI validation set\. A system operator who sees conformal coverage fall from 0\.97 to 0\.22 gets an immediate signal that the model is operating outside its valid domain\.

That is the practical value of including conformal prediction in a deployment framework, separate from its theoretical properties\. It creates a coverage audit trail\. High MIMIC singleton rates for LR \(1\.00\) and NB \(0\.99\) might appear to suggest interpretable outputs\. They do not: when coverage is only 0\.24, a singleton prediction is a confident wrong answer for roughly three quarters of patients\. Singleton rate without coverage context is not a useful interpretability metric\.

### 4\.4Comparison to Published CKD Models

The eight externally\-validated CKD models identified by Echouffo\-Tcheugui and Kengne reported a range of calibration outcomes, but assessment was typically informal, often through visual calibration plots rather than ECE or Brier Score computation\[[8](https://arxiv.org/html/2605.21566#bib.bib11)\]\. More recent models, including the KFRE validated by Tangri and colleagues\[[25](https://arxiv.org/html/2605.21566#bib.bib3)\]and its UK validation by Major and colleagues\[[17](https://arxiv.org/html/2605.21566#bib.bib4)\], achieve strong external discrimination \(C\-statistic 0\.80–0\.90\) in prospective cohort studies with purpose\-built feature ascertainment\.

The AUROC values seen on MIMIC in this study \(0\.48–0\.58\) are not comparable to those results, because the KFRE and similar models were validated on cohorts where inputs were actually measured\. The comparison instead reinforces a narrower point: a model applied to harmonized data with seven imputed features is not the same as a model applied to complete data\. Deployment performance depends on what the deployment environment actually records, not on what the training environment provided\.

### 4\.5Limitations

The most important limitation to state plainly: the MIMIC cohort used here is the publicly available 100\-patient demonstration subset released by PhysioNet for pipeline development, not a formally designed external validation cohort\. It was not selected to represent a specific clinical population, it contains only 23 CKD cases, and seven of the features were entirely missing and had to be imputed from UCI training statistics\. The stress\-test framing in this paper is intentional and accurate\. The MIMIC results show what distribution shift looks like numerically; they do not estimate performance in any real clinical deployment\.

At 97 patients after filtering, calibration metrics carry wide uncertainty regardless of the above\. Bootstrap confidence intervals for MIMIC ECE span as much as 0\.18 for some models\. Sample size requirements for precise external calibration assessment substantially exceed the 97 patients available in this stress\-test cohort\[[21](https://arxiv.org/html/2605.21566#bib.bib26)\]\. The point estimates should be read as directional, not precise\.

Feature harmonization made the missingness problem worse\. The seven imputed features were filled using UCI training\-set statistics, so those columns carry no information specific to individual MIMIC patients\. That is an inherent constraint of cross\-dataset harmonization when features are domain\-specific, and it cannot be corrected analytically after the fact\.

Subgroup analysis was additionally limited by sample size\. Diabetes and hypertension subgroups each fell below the 10\-patient exclusion threshold, and the age subgroups \(below 65:nn= 55; 65 and above:nn= 42\) were small enough that bin\-level ECE estimates carry meaningful uncertainty\. Any subgroup findings should be treated as exploratory\.

### 4\.6Future Directions

Three areas appear worth pursuing\. First, the full MIMIC\-IV database would provide several thousand patients with CKD\-relevant laboratory data, allowing larger\-scale calibration assessment and properly powered subgroup analysis\. The extraction pipeline from this study is directly applicable to the full database\.

Second, the deployment readiness framework here is intentionally simple\. Criteria are binary and thresholds are set a priori without formal sample size calculations\. Extending the framework to account for uncertainty in threshold exceedance, for example through bootstrap p\-values for each criterion, would make the checklist scores more principled\.

Third, patient\-facing uncertainty display is an open design problem\. Conformal prediction sets at the individual level \(CKD / not\-CKD / ambiguous\) produce clinically interpretable labels\. How clinicians and patients respond to ambiguous predictions in a real consultation, and whether explicit uncertainty communication changes treatment decisions compared to point probability outputs, has not been tested in CKD care\.

## 5Conclusion

Five classifiers achieved perfect discrimination on the UCI CKD benchmark, then failed every stress\-test criterion when applied to the MIMIC\-IV demo cohort under deliberate distributional shift\. Isotonic recalibration reduced internal ECE to near zero\. On the demo cohort, calibration drift exceeded 0\.67 for every model, conformal coverage fell from 0\.967 or higher internally to 0\.21–0\.25, and no model scored above 4 out of 16 on the deployment readiness checklist\.

The MIMIC cohort used here is an open\-access 100\-patient demonstration set, not a formally designed external validation dataset\. The stress\-test framing is intentional\. Its purpose is to illustrate that internal calibration quality, however strong, does not guarantee the same reliability under prevalence shift and feature missingness\. That point holds regardless of the demo cohort’s limitations\.

The eight\-criterion framework introduced here offers one structured path for evaluating these requirements before clinical use\. Each criterion has a defined threshold and a three\-level scoring scheme applicable to any binary clinical prediction model\. Formal external validation on a purpose\-built cohort with properly measured features and an appropriate sample size is the necessary next step\. The framework provides the evaluation structure\. The full MIMIC\-IV database provides the cohort\.

## References

- \[1\]A\. N\. Angelopoulos and S\. Bates\(2021\)A gentle introduction to conformal prediction and distribution\-free uncertainty quantification\.arXiv preprint arXiv:2107\.07511\.Cited by:[§2\.5](https://arxiv.org/html/2605.21566#S2.SS5.p1.2),[§4\.3](https://arxiv.org/html/2605.21566#S4.SS3.p1.1)\.
- \[2\]Q\. Bai, C\. Su, W\. Tang, and Y\. Li\(2022\)Machine learning to predict end stage kidney disease in chronic kidney disease\.Scientific Reports12,pp\. 8377\.External Links:[Document](https://dx.doi.org/10.1038/s41598-022-12316-z)Cited by:[§1](https://arxiv.org/html/2605.21566#S1.p2.1)\.
- \[3\]C\. R\. S\. Banerji, T\. Chakraborti, C\. Harbron,et al\.\(2023\)Clinical AI tools must convey predictive uncertainty for each individual patient\.Nature Medicine29,pp\. 2996–2998\.External Links:[Document](https://dx.doi.org/10.1038/s41591-023-02562-7)Cited by:[§1](https://arxiv.org/html/2605.21566#S1.p4.1)\.
- \[4\]J\. I\. Bravo\-Zunigaet al\.\(2025\)External validation, recalibration, and clinical utility of the kidney failure risk equation in patients with advanced CKD: a nationwide retrospective cohort analysis in Peru\.BMC Nephrology26,pp\. 688\.External Links:[Document](https://dx.doi.org/10.1186/s12882-025-04357-z)Cited by:[§1](https://arxiv.org/html/2605.21566#S1.p2.1)\.
- \[5\]A\. Campagner, E\. M\. Biganzoli, C\. Balsano, C\. Cereda, and F\. Cabitza\(2025\)Modeling unknowns: a vision for uncertainty\-aware machine learning in healthcare\.International Journal of Medical Informatics203,pp\. 106014\.External Links:[Document](https://dx.doi.org/10.1016/j.ijmedinf.2025.106014)Cited by:[§1](https://arxiv.org/html/2605.21566#S1.p4.1)\.
- \[6\]G\. S\. Collins, P\. Dhiman, J\. Ma,et al\.\(2024\)Evaluation of clinical prediction models \(part 1\): from development to external validation\.BMJ384,pp\. e074819\.External Links:[Document](https://dx.doi.org/10.1136/bmj-2023-074819)Cited by:[§1](https://arxiv.org/html/2605.21566#S1.p5.1)\.
- \[7\]G\. S\. Collins, K\. G\. M\. Moons, P\. Dhiman,et al\.\(2024\)TRIPOD\+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods\.BMJ385,pp\. e078378\.External Links:[Document](https://dx.doi.org/10.1136/bmj-2023-078378)Cited by:[§1](https://arxiv.org/html/2605.21566#S1.p6.1),[§2\.6](https://arxiv.org/html/2605.21566#S2.SS6.p1.1)\.
- \[8\]J\. B\. Echouffo\-Tcheugui and A\. P\. Kengne\(2012\)Risk models to predict chronic kidney disease and its progression: a systematic review\.PLOS Medicine9\(11\),pp\. e1001344\.External Links:[Document](https://dx.doi.org/10.1371/journal.pmed.1001344)Cited by:[§1](https://arxiv.org/html/2605.21566#S1.p3.1),[§4\.1](https://arxiv.org/html/2605.21566#S4.SS1.p3.1),[§4\.4](https://arxiv.org/html/2605.21566#S4.SS4.p1.1)\.
- \[9\]A\. Francis, M\. N\. Harhay, A\. C\. M\. Ong,et al\.\(2024\)Chronic kidney disease and the global public health agenda: an international consensus\.Nature Reviews Nephrology20,pp\. 473–485\.External Links:[Document](https://dx.doi.org/10.1038/s41581-024-00820-6)Cited by:[§1](https://arxiv.org/html/2605.21566#S1.p1.1)\.
- \[10\]A\. González\-Rocha, V\. A\. Colli, and E\. Denova\-Gutiérrez\(2023\)Risk prediction score for chronic kidney disease in healthy adults and adults with type 2 diabetes: systematic review\.Preventing Chronic Disease20,pp\. 220380\.External Links:[Document](https://dx.doi.org/10.5888/pcd20.220380)Cited by:[§1](https://arxiv.org/html/2605.21566#S1.p5.1)\.
- \[11\]C\. Guo, G\. Pleiss, Y\. Sun, and K\. Q\. Weinberger\(2017\)On calibration of modern neural networks\.InProceedings of the 34th International Conference on Machine Learning,pp\. 1321–1330\.Note:arXiv:1706\.04599Cited by:[§2\.4](https://arxiv.org/html/2605.21566#S2.SS4.p1.1)\.
- \[12\]A\. E\. W\. Johnson, L\. Bulgarelli, L\. Shen,et al\.\(2023\)MIMIC\-IV, a freely accessible electronic health record dataset\.Scientific Data10,pp\. 1\.External Links:[Document](https://dx.doi.org/10.1038/s41597-022-01899-x)Cited by:[§2\.1](https://arxiv.org/html/2605.21566#S2.SS1.p2.1)\.
- \[13\]KDIGO CKD Work Group\(2024\)KDIGO 2024 clinical practice guideline for the evaluation and management of chronic kidney disease\.Kidney International105\(4S\),pp\. S117–S314\.External Links:[Document](https://dx.doi.org/10.1016/j.kint.2023.10.018)Cited by:[§1](https://arxiv.org/html/2605.21566#S1.p1.1)\.
- \[14\]S\. Krishnamurthyet al\.\(2021\)Machine learning prediction models for chronic kidney disease using national health insurance claim data in Taiwan\.Healthcare \(Basel\)9\(5\),pp\. 546\.External Links:[Document](https://dx.doi.org/10.3390/healthcare9050546)Cited by:[§1](https://arxiv.org/html/2605.21566#S1.p2.1)\.
- \[15\]J\. Liet al\.\(2025\)Machine learning models for predicting short\-term progression in patients with stage 4 chronic kidney disease: a multi\-center validation study\.Scientific Reports15,pp\. 39285\.External Links:[Document](https://dx.doi.org/10.1038/s41598-025-23037-4)Cited by:[§1](https://arxiv.org/html/2605.21566#S1.p2.1)\.
- \[16\]L\. Liouet al\.\(2024\)Assessing calibration and bias of a deployed machine learning malnutrition prediction model within a large healthcare system\.npj Digital Medicine7,pp\. 149\.External Links:[Document](https://dx.doi.org/10.1038/s41746-024-01141-5)Cited by:[§4\.1](https://arxiv.org/html/2605.21566#S4.SS1.p3.1)\.
- \[17\]R\. W\. Major, D\. Shepherd, J\. F\. Medcalf,et al\.\(2019\)The Kidney Failure Risk Equation for prediction of end stage renal disease in UK primary care: an external validation and clinical impact projection cohort study\.PLOS Medicine16\(11\),pp\. e1002955\.External Links:[Document](https://dx.doi.org/10.1371/journal.pmed.1002955)Cited by:[§1](https://arxiv.org/html/2605.21566#S1.p2.1),[§4\.4](https://arxiv.org/html/2605.21566#S4.SS4.p1.1)\.
- \[18\]A\. Niculescu\-Mizil and R\. Caruana\(2005\)Predicting good probabilities with supervised learning\.InProceedings of the 22nd International Conference on Machine Learning,pp\. 625–632\.Cited by:[§2\.3](https://arxiv.org/html/2605.21566#S2.SS3.p2.1)\.
- \[19\]J\. Platt\(1999\)Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods\.InAdvances in Large Margin Classifiers,pp\. 61–74\.Cited by:[§2\.4](https://arxiv.org/html/2605.21566#S2.SS4.p2.1)\.
- \[20\]R\. D\. Riley, L\. Archer, K\. I\. E\. Snell,et al\.\(2024\)Evaluation of clinical prediction models \(part 2\): how to undertake an external validation study\.BMJ384,pp\. e074820\.External Links:[Document](https://dx.doi.org/10.1136/bmj-2023-074820)Cited by:[§1](https://arxiv.org/html/2605.21566#S1.p5.1)\.
- \[21\]R\. D\. Riley, K\. I\. E\. Snell, L\. Archer,et al\.\(2024\)Evaluation of clinical prediction models \(part 3\): calculating the sample size required for an external validation study\.BMJ384,pp\. e074821\.External Links:[Document](https://dx.doi.org/10.1136/bmj-2023-074821)Cited by:[§4\.5](https://arxiv.org/html/2605.21566#S4.SS5.p2.1)\.
- \[22\]L\. Rubini, P\. Soundarapandian, and P\. Eswaran\(2015\)Chronic Kidney Disease Dataset\.UCI Machine Learning Repository\.Note:Accessed: 2025\-12\-01\.[https://doi\.org/10\.24432/C5G020](https://doi.org/10.24432/C5G020)Cited by:[§2\.1](https://arxiv.org/html/2605.21566#S2.SS1.p1.1)\.
- \[23\]C\. Sabanayagamet al\.\(2025\)Artificial intelligence in chronic kidney disease management: a scoping review\.Theranostics15\(10\),pp\. 4566–4578\.External Links:[Document](https://dx.doi.org/10.7150/thno.108552)Cited by:[§1](https://arxiv.org/html/2605.21566#S1.p2.1)\.
- \[24\]A\. P\. Sreenivasanet al\.\(2025\)Conformal prediction enables disease course prediction and allows individualized diagnostic uncertainty in multiple sclerosis\.npj Digital Medicine8,pp\. 224\.External Links:[Document](https://dx.doi.org/10.1038/s41746-025-01616-z)Cited by:[§4\.3](https://arxiv.org/html/2605.21566#S4.SS3.p1.1)\.
- \[25\]N\. Tangri, M\. E\. Grams, A\. S\. Levey,et al\.\(2016\)Multinational assessment of accuracy of equations for predicting risk of kidney failure: a meta\-analysis\.JAMA315\(2\),pp\. 164–174\.External Links:[Document](https://dx.doi.org/10.1001/jama.2015.18202)Cited by:[§1](https://arxiv.org/html/2605.21566#S1.p2.1),[§4\.4](https://arxiv.org/html/2605.21566#S4.SS4.p1.1)\.
- \[26\]V\. Taquet, V\. Blot, T\. Morzadec, L\. Lacombe, and N\. Brunel\(2022\)MAPIE: an open\-source library for distribution\-free uncertainty quantification\.arXiv preprint arXiv:2207\.12274\.Cited by:[§2\.5](https://arxiv.org/html/2605.21566#S2.SS5.p1.2)\.
- \[27\]B\. Van Calster, D\. J\. McLernon, M\. van Smeden, L\. Wynants, and E\. W\. Steyerberg\(2019\)Calibration: the Achilles heel of predictive analytics\.BMC Medicine17,pp\. 230\.External Links:[Document](https://dx.doi.org/10.1186/s12916-019-1466-7)Cited by:[§1](https://arxiv.org/html/2605.21566#S1.p3.1)\.
- \[28\]B\. Vasey, M\. Nagendran, B\. Campbell,et al\.\(2022\)Reporting guideline for the early stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE\-AI\.BMJ377,pp\. e070904\.External Links:[Document](https://dx.doi.org/10.1136/bmj-2022-070904)Cited by:[§2\.6](https://arxiv.org/html/2605.21566#S2.SS6.p1.1)\.
- \[29\]J\. Vazquez and J\. C\. Facelli\(2022\)Conformal prediction in clinical medical sciences\.Journal of Healthcare Informatics Research6,pp\. 241–252\.External Links:[Document](https://dx.doi.org/10.1007/s41666-021-00113-8)Cited by:[§4\.3](https://arxiv.org/html/2605.21566#S4.SS3.p1.1)\.
- \[30\]B\. Zadrozny and C\. Elkan\(2002\)Transforming classifier scores into accurate multiclass probability estimates\.InProceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,pp\. 694–699\.Cited by:[§2\.4](https://arxiv.org/html/2605.21566#S2.SS4.p2.1)\.

Similar Articles

Confidence Calibration in Large Language Models

arXiv cs.AI

This paper analyzes the confidence calibration of 11 popular LLMs, finding that they are generally overconfident, especially on hard tasks, and underconfident on easy tasks. It introduces LifeEval, a test for evaluating calibration across difficulty levels.