A Reproducible Log-Driven AutoML Framework for Interpretable Pipeline Optimization in Healthcare Risk Prediction
Summary
This paper introduces yvsoucom-iterkit, a deterministic, log-driven AutoML framework for reproducible pipeline optimization in healthcare risk prediction, evaluated on diabetes and stroke datasets with over 18,000 pipeline configurations, achieving strong performance and revealing structured search spaces with component redundancy.
View Cached Full Text
Cached at: 05/22/26, 08:47 AM
# A Reproducible Log-Driven AutoML Framework for Interpretable Pipeline Optimization in Healthcare Risk Prediction
Source: [https://arxiv.org/html/2605.21528](https://arxiv.org/html/2605.21528)
\[orcid=0000\-0000\-0000\-0000\]
1\]organization=School of Basic Medicine, Hangzhou Normal University, addressline=No\.2318, Yuhangtang Rd, Yuhang District, city=Hangzhou, postcode=311121, state=Zhejiang, country=China
\[orcid=0000\-0002\-9893\-6135\]\\cormark\[1\]
2\]organization=Hangzhou Domain Zones Technology Co\.Ltd\., addressline=, city=Hangzhou, postcode=, state=Zhejiang, country=China
\\cortext
\[cor1\]Corresponding author
###### Abstract
Accurate and reproducible disease risk prediction remains challenging due to heterogeneous features, limited samples, and severe class imbalance\. This study introduces yvsoucom\-iterkit, a deterministic and log\-driven automated machine learning framework that formulates pipeline optimization as a fully reproducible, configuration\-level system\. Each pipeline is encoded as a traceable log entity, enabling analysis of component attribution, interactions, similarity, and cross\-seed robustness\. Experiments on the Pima Indians Diabetes and Stroke datasets across more than 18,000 pipeline configurations reveal a structured and partially redundant search space, where performance is governed by a small subset of interacting components\. Random Forest importance analysis identifies augmentation \(0\.454\), model choice \(0\.198\), and imbalance handling \(0\.101\) as key drivers on Pima, while imbalance handling dominates Stroke \(0\.406\)\. Component similarity analysis shows strong redundancy, with feature selection variants \(biMax–biMean\) exhibiting low RMS distance \(0\.0252\), mixup closely matching no augmentation \(0\.0279\), and TomekLinks aligning with no imbalance handling \(0\.0325\), whereas Gaussian noise shows greater divergence from no augmentation \(0\.10\)\. The framework achieves strong and stable performance using ensemble models \(Weighted\-F1 0\.89, Macro\-F1 0\.88 on Pima; Weighted\-F1 0\.94 on Stroke\), while Macro\-F1 remains lower on Stroke \(0\.67\) due to class imbalance\. Cross\-seed analysis reveals a performance–robustness trade\-off, with ensembles showing lower variability \(0\.023–0\.026\) than SVM\. These results indicate that effective AutoML optimization can focus on a reduced set of high\-impact components\.
###### keywords:
Automated machine learning\\sepLog\-driven systems\\sepReproducible machine learning\\sepPipeline optimization\\sepClass imbalance\\sepHealthcare risk prediction\\sepModel interpretability
\{highlights\}
A deterministic, log\-driven AutoML framework enables fully reproducible and traceable pipeline optimization
Large\-scale evaluation \(18,000\+ pipelines\) reveals a structured and partially redundant AutoML search space
Performance is governed by a small subset of interacting components
Ensemble models achieve strong and stable performance \(Macro\-F1≈0\.88\\approx 0\.88on Pima; Weighted\-F1≈0\.94\\approx 0\.94on Stroke\), while Stroke shows reduced Macro\-F1 \(≈0\.67\\approx 0\.67\) under severe class imbalance
Cross\-seed analysis reveals a performance–robustness trade\-off, with ensembles more stable \(σ≈0\.023\\sigma\\approx 0\.023–0\.026\) than SVM
## 1Introduction
Chronic diseases such as diabetes mellitus and cerebrovascular disorders represent major global health challenges, affecting hundreds of millions of individuals worldwide\. Early identification of high\-risk individuals is critical for enabling timely intervention and reducing severe complications, including cardiovascular and neurological conditions\. With the increasing availability of structured clinical data, machine learning \(ML\) techniques have become essential tools for healthcare risk prediction, enabling the discovery of complex patterns from heterogeneous clinical attributes\[HaibeKains2020\]\.
Despite significant progress, developing reliable ML\-based healthcare prediction models remains challenging\. Clinical datasets are often small, imbalanced, and heterogeneous, and model performance is highly sensitive to preprocessing decisions, including feature selection, normalization, and data augmentation\. Many existing studies rely on manually designed or weakly optimized pipelines, limiting reproducibility and generalizability across datasets and clinical settings\.
Automated machine learning \(AutoML\) has emerged as a promising solution by enabling data\-driven exploration of model architectures and hyperparameters\. However, most existing AutoML frameworks, such as Auto\-sklearn, TPOT, and AutoGluon, primarily focus on model selection and hyperparameter optimization, while treating preprocessing and data transformation steps as fixed or only loosely optimized components\. As a result, complex interactions between preprocessing strategies and learning algorithms remain underexplored, particularly in small and imbalanced healthcare datasets where such interactions are critical\.
To address these limitations, we proposeyvsoucom\-iterkit, a pipeline\-centric AutoML framework that jointly optimizes preprocessing and modeling within a unified search space\. The framework treats feature selection, normalization, data augmentation, class\-imbalance handling, and classification models as first\-class components, enabling systematic exploration of their combinations\. It operates as a fully automated and reproducible experimentation engine, capable of generating, executing, and evaluating thousands of pipeline configurations without human intervention\.
Compared with our previous work\[huang2026\_ssrn\], this study significantly extends the scope and rigor of the proposed system by incorporating additional datasets, expanding the pipeline search space, and introducing comprehensive cross\-seed evaluation to assess robustness\. Furthermore, the framework integrates automated logging and statistical analysis modules, enabling transparent and reproducible benchmarking across large\-scale experiments\.
To evaluate the effectiveness and generality of the proposed system, we conduct extensive experiments on two representative healthcare prediction tasks: diabetes risk prediction using the Pima Indians Diabetes dataset and stroke risk prediction using the Healthcare Stroke Dataset\. These datasets differ substantially in feature composition, class distribution, and clinical context, providing a rigorous testbed for assessing robustness and transferability\. Experimental results demonstrate that the proposed framework consistently identifies high\-performing and stable pipeline configurations, highlighting the importance of joint optimization across preprocessing and modeling stages\.
Although this study focuses on two representative healthcare tasks, the proposed framework is designed as a general\-purpose AutoML system for structured data\. It supports flexible combinations of preprocessing and modeling strategies and can be readily applied to other domains involving categorical prediction tasks\.
The main contributions and findings of this work are summarized as follows:
- •We propose a pipeline\-centric AutoML framework that jointly optimizes preprocessing, data augmentation, imbalance handling, and classification models within a unified and fully reproducible configuration space for healthcare data analysis\.
- •We introduce a log\-driven \(LogDir\) execution paradigm that enables transparent and traceable large\-scale experimentation by linking each pipeline configuration to its performance outcomes\.
- •We conduct a large\-scale evaluation over 18,000\+ pipeline configurations on the Pima and Stroke datasets, enabling multi\-level analysis from raw metric distributions to pipeline\-level behavior, component\-level attribution, cross\-component interactions, and cross\-seed robustness\.
- •We find that the AutoML search space is highly structured and partially redundant, where many configurations yield near\-equivalent performance and only a small subset of components drives performance variation\.
- •We find that ensemble models consistently achieve the best balance between accuracy and robustness, while high\-capacity models such as SVM exhibit higher sensitivity to stochastic variation across seeds\.
- •We identify strong dataset\-dependent behavior, where Pima exhibits a relatively stable performance landscape, while Stroke shows pronounced sensitivity due to severe class imbalance\.
The remainder of this paper is organized as follows\. Section[2](https://arxiv.org/html/2605.21528#S2)reviews related work on healthcare prediction and AutoML\. Section[3](https://arxiv.org/html/2605.21528#S3)presents the proposed methodology and framework\. Section[4](https://arxiv.org/html/2605.21528#S4)reports experimental results, followed by discussion and conclusions in Sections[5](https://arxiv.org/html/2605.21528#S5)and[6](https://arxiv.org/html/2605.21528#S6)\.
## 2Related Work
### 2\.1Automated Machine Learning
Automated Machine Learning \(AutoML\) has evolved into a central paradigm for reducing manual effort in model development by automating algorithm selection, hyperparameter tuning, and, to a limited extent, pipeline construction\[hutter2019automated,HE2021106622\]\. Early systems such as Auto\-WEKA\[Thornton2013\]formulated the Combined Algorithm Selection and Hyperparameter \(CASH\) problem and addressed it using Bayesian optimization\. Subsequent approaches, including TPOT\[Olson2016\], introduced evolutionary strategies to explore pipeline configurations, while Auto\-sklearn\[Feurer2015,Feurer2022\]incorporated meta\-learning and ensemble construction to improve efficiency and robustness\. More recent frameworks such as AutoGluon\[Erickson2020AutoGluon\]and H2O AutoML\[LeDell2020H2O\]further emphasize scalability and ensemble learning in large\-scale applications\.
Despite these advances, existing AutoML systems largely retain a*model\-centric optimization paradigm*\. Preprocessing operations—such as feature selection, normalization, and data augmentation—are typically treated as auxiliary or weakly parameterized components rather than first\-class elements of the search space\[Zoller2021Benchmarking\]\. Even when pipeline structures are explored \(e\.g\., TPOT\), the search process is often stochastic and opaque, limiting interpretability and reproducibility\. Furthermore, most frameworks prioritize predictive performance as the primary objective, with limited consideration of robustness, variability, and statistical reliability across runs\[Bischl2023Hyperparameter\]\.
These limitations are particularly restrictive in healthcare settings, where model behavior is highly sensitive to preprocessing choices and data characteristics\. As a result, existing AutoML approaches provide only partial automation and insufficient transparency for systematic pipeline\-level analysis\.
### 2\.2Preprocessing and Imbalanced Learning in Healthcare Data
Preprocessing is a critical determinant of model performance in clinical machine learning due to the inherent characteristics of healthcare datasets, including small sample sizes, heterogeneous feature types, and severe class imbalance\. Feature selection methods, particularly those based on information theory such as information gain and mutual information, have been widely adopted to improve generalization and reduce overfitting\[Guyon2006,Chandrashekar2014,li2017feature\]\. Wrapper and embedded methods further extend these approaches but often introduce additional computational complexity and instability\[Saeys2007Feature\]\.
Class imbalance further complicates predictive modeling\. Techniques such as SMOTE\[Chawla2002SMOTE\], Tomek Links\[Tomek1976\], and hybrid resampling methods \(e\.g\., SMOTEENN\[Batista2004SMOTEENN\]\) aim to rebalance class distributions\. More recent approaches, including ADASYN\[He2008ADASYN\]and MixUp\[zhang2018mixup\], introduce adaptive or interpolation\-based augmentation to improve generalization\. Cost\-sensitive learning and focal loss mechanisms have also been proposed to address imbalance at the algorithmic level\[Lin2017FocalLoss\]\.
Although these methods have demonstrated effectiveness in specific scenarios, they are typically evaluated in isolation or within manually constructed pipelines\. Consequently, their combined effects, interactions, and trade\-offs remain insufficiently understood\. Moreover, prior studies rarely investigate how preprocessing decisions influence not only predictive accuracy but also*stability under stochastic variation*, which is critical for reliable deployment in clinical environments\.
### 2\.3Machine Learning for Healthcare Risk Prediction
Machine learning techniques have been extensively applied to healthcare risk prediction tasks, including diabetes diagnosis and stroke prediction\. Ensemble methods such as Random Forest\[Breiman2001\]and XGBoost\[chen2016xgboost\]are frequently reported as strong performers on structured clinical data due to their ability to model nonlinear relationships and handle feature interactions\[sarwar2020diagnosis\]\. Gradient boosting frameworks have also demonstrated superior performance in tabular domains compared to deep learning approaches\[Grinsztajn2022Tabular\]\.
Numerous studies on the Pima Indians Diabetes dataset and stroke prediction benchmarks report improved performance through tailored combinations of preprocessing and model tuning\[pati2023review,patil2013performance,Singh2025StrokeML\]\. Deep learning approaches, including multilayer perceptrons and hybrid architectures, have also been explored, but often require extensive tuning and larger datasets to achieve competitive performance\[Shickel2018DeepEHR\]\.
However, these approaches are typically dataset\-specific and rely on manual or heuristic design choices\. As a result, their conclusions are often difficult to generalize, and reproducibility is limited due to incomplete reporting of experimental configurations\[Pineau2021ImprovingReproducibility\]\. More importantly, existing studies predominantly focus on optimizing single pipelines or a small set of configurations, rather than systematically exploring the broader pipeline space\. This restricts the ability to identify generalizable patterns, quantify variability, or understand the relative importance of different pipeline components\.
### 2\.4Research Gap and Contribution
The above analysis reveals a fundamental gap in current research: the absence of a unified, reproducible, and systematically analyzable framework for*pipeline\-centric*AutoML in structured healthcare data\.
Existing AutoML systems emphasize model\-level optimization but lack transparent and deterministic exploration of preprocessing–model interactions\. Conversely, studies on preprocessing and healthcare prediction provide valuable insights into individual techniques but fail to integrate them into a scalable and reproducible experimental framework\. In addition, robustness and stochastic sensitivity are rarely incorporated into evaluation protocols, despite their importance in real\-world clinical deployment\.
To address these limitations, we proposeyvsoucom\-iterkit, a pipeline\-centric AutoML framework that explicitly models preprocessing operations, augmentation strategies, imbalance handling, and classification algorithms within a unified and fully enumerable search space\. Unlike prior approaches, the proposed system adopts a*log\-driven and deterministic execution paradigm*, enabling:
- •systematic and exhaustive exploration of thousands of pipeline configurations,
- •full reproducibility through structured logging of all experimental details,
- •component\-level and interaction\-level statistical analysis,
- •and robustness evaluation through multi\-seed experimentation\.
This design shifts AutoML from a black\-box optimization problem to a transparent and analyzable experimental process, enabling deeper understanding of how pipeline components jointly influence performance, robustness, and generalization\. In particular, by integrating multi\-seed evaluation and statistical aggregation, the framework provides an implicit bridge between empirical AutoML practice and theoretical considerations such as bias–variance trade\-offs and stochastic sensitivity, which are largely absent from existing pipeline optimization frameworks\.
## 3Methodology and Framework
This section presents the proposed pipeline\-centric AutoML framework, integrating dataset design, pipeline modeling, system architecture, and automated experimentation into a unified methodology\. The framework systematically explores a combinatorial space of pipeline configurations through deterministic enumeration, enabling large\-scale, reproducible, and interaction\-aware benchmarking for structured healthcare data\.
### 3\.1Datasets
Experiments are conducted on two publicly available datasets representing heterogeneous clinical settings and feature characteristics:
Pima Indians Diabetes Dataset\.This dataset contains 768 instances with eight clinical features \(e\.g\., glucose, blood pressure, BMI, age\) and a binary diabetes outcome\. All features are numerical, and the dataset exhibits moderate class imbalance\.
Healthcare Stroke Dataset\.This dataset includes both numerical and categorical variables, such as age, average glucose level, BMI, hypertension status, heart disease, and smoking status, with a binary stroke label\. Compared to the diabetes dataset, it presents higher class imbalance and increased feature heterogeneity due to mixed data types\.
These datasets serve as complementary case studies, enabling evaluation across purely numerical and mixed\-type feature spaces, thereby assessing the robustness and transferability of the proposed AutoML framework under varying data characteristics\.
### 3\.2Pipeline Design and Search Space
The proposed framework defines a unified pipeline configuration space:
𝒫=F×K×N×O×A×B×M×S×T×R\\mathcal\{P\}=F\\times K\\times N\\times O\\times A\\times B\\times M\\times S\\times T\\times R\(1\)
where each dimension corresponds to a configurable component:
- •FF— feature selection strategy
- •KK— number of selected features
- •NN— normalization method
- •OO— normalization order \(before or after augmentation and imbalance handling\)
- •AA— data augmentation method
- •BB— imbalance handling method
- •MM— classification model
- •SS— train/test split ratio
- •TT— decision threshold
- •RR— random seed
Each configuration𝒫\\mathcal\{P\}defines a complete machine learning pipeline, where preprocessing, modeling, and experimental factors are jointly specified\. The framework captures interactions among these components within a unified combinatorial search space\.
Unlike conventional AutoML approaches that rely on stochastic or heuristic search strategies, the proposed framework adopts deterministic enumeration of𝒫\\mathcal\{P\}\. This enables systematic and exhaustive \(or controlled\) exploration of the configuration space, ensuring full transparency, reproducibility, and comparability across pipeline variants\.
By explicitly integrating preprocessing operations, execution order, model selection, and experimental parameters into a unified formulation, the proposed search space substantially extends conventional AutoML designs\. This unified representation provides a structured and interpretable foundation for interaction\-aware analysis of pipeline behavior, enabling systematic investigation of how individual components jointly influence performance\.
Consequently, the proposed formulation shifts the focus of AutoML from isolated model\-centric optimization toward holistic, pipeline\-centric system exploration, where performance is understood as an emergent property of coordinated component interactions rather than single\-stage decisions\.
### 3\.3Framework Architecture and Automated Execution
Figure 1:System architecture of the proposed AutoML framework\. The framework follows a centralized configuration and deterministic pipeline enumeration strategy, enabling parallel execution and reproducible large\-scale experimentation\.As illustrated in Fig\.[1](https://arxiv.org/html/2605.21528#S3.F1), the proposed AutoML framework is a modular and extensible system for pipeline\-centric optimization on structured healthcare datasets\. It adopts a deterministic, pre\-enumerated execution model that enables systematic exploration of preprocessing, modeling, and experimental configurations within a unified search space\.
The framework consists of five core components: \(i\) data ingestion, \(ii\) preprocessing and pipeline optimization, \(iii\) model management, \(iv\) experiment execution, and \(v\) result aggregation and analysis\. These components interact through standardized interfaces and are coordinated by a centralized configuration module, ensuring consistency across all experimental stages\.
Each experimental configuration is represented as an independent pipeline instance generated prior to execution\. This flat execution structure eliminates recursive dependencies and ensures that all configurations are explicitly defined, directly comparable, and fully reproducible\.
Deterministic enumeration enables complete or controlled coverage of the pipeline search space, in contrast to stochastic AutoML approaches based on heuristic or probabilistic search\. Combined with parallel execution and structured logging, this design supports large\-scale experimentation with strong guarantees of traceability, consistency, and reproducibility\.
##### Centralized Configuration and Branch Enumeration\.
A centralized configuration defines the full experimental design, including datasets, preprocessing options, pipeline components, model candidates, hyperparameter ranges, evaluation metrics, and statistical protocols\. Based on this configuration, the framework deterministically enumerates all valid pipeline configurations \(referred to as*branches*\), each representing a unique combination of preprocessing strategies, model choices, and experimental parameters\.
##### Model Integration and Registry\.
Models are integrated through a unified interface that standardizes training, prediction, and evaluation\. A centralized registry maintains metadata for each model, including supported input formats, preprocessing compatibility, and configurable hyperparameters\.
##### Parallel Execution and Caching\.
Each pipeline configuration is executed independently and can be dispatched in parallel\. Execution order does not affect correctness due to pre\-defined branch generation\. To improve efficiency, the framework supports caching of intermediate results \(e\.g\., preprocessed data\), indexed using deterministic configuration signatures for consistent reuse\.
##### Logging, Output, and Reproducibility\.
All experiments are executed within theyvsoucom\-iterkitsystem, which automates pipeline generation, execution, and evaluation under a unified workflow\. Controlled random seeds ensure deterministic behavior across runs\. A structured logging mechanism records complete experimental metadata, including pipeline configurations, preprocessing steps, model parameters, intermediate outputs, raw predictions, and evaluation metrics\. Each execution is associated with a unique branch identifier, enabling full traceability and systematic aggregation of results\.
The log\-centric design preserves complete execution histories, supporting reproducible analysis and independent verification of pipeline behavior\. By maintaining access to both intermediate and final outputs, the framework enables fine\-grained inspection of model performance beyond aggregate metrics\.
### 3\.4Preprocessing and Pipeline Components
The proposed framework treats preprocessing operations as first\-class components within the optimization space, enabling their systematic integration and joint evaluation with model selection\. The following preprocessing strategies are considered:
Missing Value Handling\.Implausible values \(e\.g\., zero entries in physiological measurements\) are treated as missing and imputed using mean substitution:
xi=\{μ,ifxiis missing or invalidxi,otherwisex\_\{i\}=\\begin\{cases\}\\mu,&\\text\{if \}x\_\{i\}\\text\{ is missing or invalid\}\\\\ x\_\{i\},&\\text\{otherwise\}\\end\{cases\}\(2\)
Feature Selection\.Features are ranked using Information Gain \(IG\):
IG\(Y,A\)=H\(Y\)−H\(Y∣A\)IG\(Y,A\)=H\(Y\)\-H\(Y\\mid A\)\(3\)The top\-kkfeatures are selected based on ranking\. To capture different relevance criteria, three variants are considered: \(i\) standard IG, \(ii\) maximum binary IG, and \(iii\) mean binary IG\.
Normalization\.Feature scaling is performed using two standard strategies:
x′\\displaystyle x^\{\\prime\}=x−μσ\(StandardScaler\)\\displaystyle=\\frac\{x\-\\mu\}\{\\sigma\}\\quad\(\\text\{StandardScaler\}\)\(4\)x′\\displaystyle x^\{\\prime\}=x−xminxmax−xmin\(MinMaxScaler\)\\displaystyle=\\frac\{x\-x\_\{\\min\}\}\{x\_\{\\max\}\-x\_\{\\min\}\}\\quad\(\\text\{MinMaxScaler\}\)\(5\)
These methods are evaluated in conjunction with different execution orders within the pipeline\.
Data Augmentation\.To mitigate limited sample sizes and improve generalization, augmentation is applied exclusively to the training data\. The framework supports multiple strategies, including CTGAN\-based synthetic generation, Gaussian noise injection, and MixUp interpolation, enabling systematic comparison of generative and perturbation\-based approaches\.
Imbalance Handling\.To address class imbalance, the framework incorporates a diverse set of resampling techniques, including SMOTE, ADASYN, SMOTEENN, Tomek Links, and random over/under\-sampling\. Rather than evaluating these methods in isolation, the framework integrates them within the full pipeline, allowing analysis of their interactions with feature selection, normalization, and model choice\.
### 3\.5Learning Models
The framework evaluates a diverse set of classifiers:
- •Logistic Regression \(LR\)— linear baseline model\.
- •Support Vector Machine \(SVM\)— margin\-based classifier with kernel support\.
- •Decision Tree \(DT\)— interpretable tree\-based model\.
- •Random Forest \(RF\)— ensemble of decision trees with bootstrap aggregation\.
- •Gradient Boosting \(GB\)— sequential ensemble minimizing loss gradients\.
- •XGBoost \(XGB\)— regularized and efficient gradient boosting implementation\.
Additionally, two neural network architectures are considered: a deep model with multiple hidden layers and regularization, and a shallow model for lightweight learning\. Both are trained using Adam optimization, early stopping, and binary cross\-entropy loss\.
### 3\.6Evaluation Metrics
Performance is evaluated using accuracy, precision, recall, and F1\-score to capture both overall predictive performance and class\-specific behavior\. For multi\-class aggregation, three standard strategies are employed:
- •Macro averaging— unweighted mean across classes
- •Weighted averaging— weighted by class support
- •Micro averaging— global computation over all samples
For binary classification, the metrics are defined as:
Precision=TPTP\+FP,\\displaystyle=\\frac\{TP\}\{TP\+FP\},\(6\)Recall=TPTP\+FN,\\displaystyle=\\frac\{TP\}\{TP\+FN\},\(7\)F1=2PRP\+R\.\\displaystyle=\\frac\{2PR\}\{P\+R\}\.\(8\)
This multi\-metric evaluation provides a comprehensive assessment of model performance, particularly under class imbalance\.
### 3\.7Multi\-Level Analysis Framework
To systematically analyze the pipeline search space, we adopt a multi\-level analytical framework that captures performance behavior from global distribution to fine\-grained component interactions\.
\(1\) Branch\-Level Distribution Analysis\.Let𝒫\\mathcal\{P\}denote the set of all pipeline configurations\. We analyze the distribution of performance metrics\{M\(p\)∣p∈𝒫\}\\\{M\(p\)\\mid p\\in\\mathcal\{P\}\\\}to characterize global properties of the search space, including clustering, dispersion, and structural patterns\. This analysis captures overall stability and sensitivity across configurations\.
\(2\) Pipeline\-Level Performance Analysis\.We identify top\- and bottom\-performing configurations based on evaluation metrics such as weighted precision, recall, F1\-score, and integrated performance score\. Formally, we rank configurationsp∈𝒫p\\in\\mathcal\{P\}according toM\(p\)M\(p\)and analyze the corresponding component combinations\. This enables identification of effective pipeline designs\.
\(3\) Component\-Level Statistical Analysis\.For each componentCiC\_\{i\}, we aggregate performance over all configurations sharing the same valuev∈𝒱iv\\in\\mathcal\{V\}\_\{i\}\. This yields statisticsμi,v\\mu\_\{i,v\}andσi,v\\sigma\_\{i,v\}, representing the mean performance and variability associated with each component value\. This analysis isolates the general effect of individual components independent of specific pipeline compositions\.
\(4\) Cross\-Component Interaction Analysis\.To capture dependencies between components, we analyze joint distributions of performance conditioned on multiple components, i\.e\.,M\(p\)∣\(Ci=vi,Cj=vj\)M\(p\)\\mid\(C\_\{i\}=v\_\{i\},C\_\{j\}=v\_\{j\}\)\. This reveals interaction effects that cannot be observed through marginal analysis, demonstrating that pipeline performance is governed by combinations of components rather than isolated factors\.
\(5\) Robustness and Variability Analysis\.To evaluate reliability, we analyze performance across multiple random seeds\. For each configuration, we compute mean and variance over repeated runs\. Additionally, non\-parametric statistical tests \(e\.g\., Friedman and Wilcoxon tests\) are applied to assess the significance of differences between configurations\. This ensures that observed performance patterns reflect consistent behavior rather than stochastic variation\.
### 3\.8Component Importance and Search Space Analysis
This subsection formalizes the analysis of pipeline component importance and search space structure\. The goal is to identify which components significantly influence performance, how stable their effects are, and how configuration choices shape the optimization landscape\.
##### Pipeline Search Space\.
Let𝒫\\mathcal\{P\}denote the pipeline search space, defined as the Cartesian product ofkkcomponents:
𝒫=𝒱1×𝒱2×⋯×𝒱k,\\mathcal\{P\}=\\mathcal\{V\}\_\{1\}\\times\\mathcal\{V\}\_\{2\}\\times\\cdots\\times\\mathcal\{V\}\_\{k\},\(9\)where each componentCiC\_\{i\}takes values from a finite set𝒱i\\mathcal\{V\}\_\{i\}\. A pipeline configuration is denoted asp=\(v1,v2,…,vk\)∈𝒫p=\(v\_\{1\},v\_\{2\},\\dots,v\_\{k\}\)\\in\\mathcal\{P\}\. LetM\(p\)M\(p\)denote the performance metric \(e\.g\., Accuracy, F1\-score\) associated with configurationpp\.
##### Component\-Level Aggregation\.
To evaluate the global effect of a componentCiC\_\{i\}, performance is aggregated over all configurations sharing the same valuev∈𝒱iv\\in\\mathcal\{V\}\_\{i\}:
μi,v=𝔼\[M\(p\)∣Ci=v\],σi,v=Std\(M\(p\)∣Ci=v\)\.\\mu\_\{i,v\}=\\mathbb\{E\}\[M\(p\)\\mid C\_\{i\}=v\],\\quad\\sigma\_\{i,v\}=\\mathrm\{Std\}\(M\(p\)\\mid C\_\{i\}=v\)\.\(10\)
Component\-level statistics:
Δi\\displaystyle\\Delta\_\{i\}=maxv∈𝒱iμi,v−minv∈𝒱iμi,v,\\displaystyle=\\max\_\{v\\in\\mathcal\{V\}\_\{i\}\}\\mu\_\{i,v\}\-\\min\_\{v\\in\\mathcal\{V\}\_\{i\}\}\\mu\_\{i,v\},\(11\)σi\\displaystyle\\sigma\_\{i\}=𝔼v∈𝒱i\[σi,v\],\\displaystyle=\\mathbb\{E\}\_\{v\\in\\mathcal\{V\}\_\{i\}\}\[\\sigma\_\{i,v\}\],\(12\)Si\\displaystyle S\_\{i\}=Varv∈𝒱i\(μi,v\),\\displaystyle=\\mathrm\{Var\}\_\{v\\in\\mathcal\{V\}\_\{i\}\}\(\\mu\_\{i,v\}\),\(13\)where:
- •Δi\\Delta\_\{i\}measures the performance impact of componentCiC\_\{i\},
- •σi\\sigma\_\{i\}measures stability across configurations,
- •SiS\_\{i\}captures sensitivity to value changes, defined as
Si=1\|𝒱i\|∑v∈𝒱i\(μi,v−μ¯i\)2,μ¯i=1\|𝒱i\|∑v∈𝒱iμi,v\.S\_\{i\}=\\frac\{1\}\{\|\\mathcal\{V\}\_\{i\}\|\}\\sum\_\{v\\in\\mathcal\{V\}\_\{i\}\}\\left\(\\mu\_\{i,v\}\-\\bar\{\\mu\}\_\{i\}\\right\)^\{2\},\\quad\\bar\{\\mu\}\_\{i\}=\\frac\{1\}\{\|\\mathcal\{V\}\_\{i\}\|\}\\sum\_\{v\\in\\mathcal\{V\}\_\{i\}\}\\mu\_\{i,v\}\.\(14\)
##### Random Forest–Based Attribution\.
A Random Forest model predicts performance metrics from pipeline configurations\. The resulting feature importance provides a model\-based estimate of component contribution:
Ri=RFImportance\(Ci\)\.R\_\{i\}=\\text\{RFImportance\}\(C\_\{i\}\)\.\(15\)
##### Integrated Definition of Component Importance\.
A potential direction for future work is the development of an integrated component importance measure that combines multiple factors, including performance impact, sensitivity, model\-based attribution, and stability\. Such a unified metric could provide a more comprehensive assessment of component influence across pipeline configurations\. Component importance is defined as:
ℐ\(Ci\)=f\(Δi,σi,Si,Ri\),\\mathcal\{I\}\(C\_\{i\}\)=f\\big\(\\Delta\_\{i\},\\ \\sigma\_\{i\},\\ S\_\{i\},\\ R\_\{i\}\\big\),\(16\)where importance is high if any of the following is true:
- •High performance impact \(Δi\\Delta\_\{i\}\),
- •High sensitivity across values \(SiS\_\{i\}\),
- •High Random Forest attribution \(RiR\_\{i\}\),
- •Acceptable stability \(moderateσi\\sigma\_\{i\}\)\.
### 3\.9Value\-Level Component Similarity Analysis
To quantify functional similarity between different values of a pipeline component, we perform a value\-level RMS\-difference analysis across all pipeline branches\. This approach identifies configurations that yield nearly identical performance, enabling search space pruning and more efficient AutoML exploration\.
##### Setup\.
LetCiC\_\{i\}denote a pipeline component with value set𝒱i=v1,v2,…,v\|𝒱i\|\\mathcal\{V\}i=\{v\_\{1\},v\_\{2\},\\dots,v\{\|\\mathcal\{V\}\_\{i\}\|\}\}\. Letℬ\\mathcal\{B\}denote the set of all pipeline branches \(configurations\)\. For each branchb∈ℬb\\in\\mathcal\{B\}and each valuev∈𝒱iv\\in\\mathcal\{V\}\_\{i\}, letMmb\(v\)M\_\{m\}^\{b\}\(v\)denote the performance on metricmmwhenCi=vC\_\{i\}=vand other component values are fixed as in branchbb\.
##### Per\-Branch RMS Computation\.
For each branchbband each pair of values\(va,vb\)\(v\_\{a\},v\_\{b\}\)of componentCiC\_\{i\}, compute the root\-mean\-square difference across metrics:
RMSb\(va,vb\)=1\|Metrics\|∑m∈Metrics\(Mmb\(va\)−Mmb\(vb\)\)2\.\\text\{RMS\}^\{b\}\(v\_\{a\},v\_\{b\}\)=\\sqrt\{\\frac\{1\}\{\|\\text\{Metrics\}\|\}\\sum\_\{m\\in\\text\{Metrics\}\}\\big\(M\_\{m\}^\{b\}\(v\_\{a\}\)\-M\_\{m\}^\{b\}\(v\_\{b\}\)\\big\)^\{2\}\}\.\(17\)
If metric normalization is desired, divide each squared difference by the corresponding metric varianceVarb\(Mmb\)\\mathrm\{Var\}\_\{b\}\(M\_\{m\}^\{b\}\)before taking the square root\.
##### Aggregate Across Branches\.
To summarize the value\-pair difference across all branches, compute the mean RMS:
RMStotal\(va,vb\)=1\|ℬ\|∑b∈ℬRMSb\(va,vb\)\.\\text\{RMS\}\\text\{total\}\(v\_\{a\},v\_\{b\}\)=\\frac\{1\}\{\|\\mathcal\{B\}\|\}\\sum\{b\\in\\mathcal\{B\}\}\\text\{RMS\}^\{b\}\(v\_\{a\},v\_\{b\}\)\.\(18\)
This provides a single, interpretable measure of functional difference betweenvav\_\{a\}andvbv\_\{b\}across the entire search space\.
##### Interpretation\.
- •SmallRMStotal\(va,vb\)\\text\{RMS\}\\text\{total\}\(v\_\{a\},v\_\{b\}\)indicates functional equivalence betweenvav\_\{a\}andvbv\_\{b\}\.
- •LargeRMStotal\(va,vb\)\\text\{RMS\}\\text\{total\}\(v\_\{a\},v\_\{b\}\)indicates meaningful performance differences\.
##### Operationalization\.
1. 1\.For each componentCiC\_\{i\}, iterate over all value pairs\(va,vb\)\(v\_\{a\},v\_\{b\}\)\.
2. 2\.Compute per\-branch RMS differencesRMSb\(va,vb\)\\text\{RMS\}^\{b\}\(v\_\{a\},v\_\{b\}\)while keeping other component values fixed\.
3. 3\.Aggregate over branches to computeRMStotal\(va,vb\)\\text\{RMS\}\_\{\\text\{total\}\}\(v\_\{a\},v\_\{b\}\)\.
4. 4\.Identify clusters of functionally equivalent values for search space reduction\.
##### Benefits\.
This RMS\-based analysis provides a fine\-grained, quantitative measure of redundancy and similarity in the pipeline search space, complementing component\-level and branch\-level analyses while offering an interpretable scale of performance difference\.
### 3\.10Cross\-Seed Robustness Analysis
Each pipeline configuration is evaluated across multiple random seeds\. For each metricmm, compute mean and standard deviation:
m¯=1R∑r=1Rmr,σm=1R∑r=1R\(mr−m¯\)2\.\\displaystyle\\bar\{m\}=\\frac\{1\}\{R\}\\sum\_\{r=1\}^\{R\}m\_\{r\},\\quad\\sigma\_\{m\}=\\sqrt\{\\frac\{1\}\{R\}\\sum\_\{r=1\}^\{R\}\(m\_\{r\}\-\\bar\{m\}\)^\{2\}\}\.\(19\)
Statistical significance is assessed using the Friedman test followed by Wilcoxon signed\-rank tests\.
TheNormalized Robust Rank Score \(NRRS\)combines performance and stability:
NRRS=r¯\+λ⋅σr,\\text\{NRRS\}=\\bar\{r\}\+\\lambda\\cdot\\sigma\_\{r\},\(20\)wherer¯\\bar\{r\}is mean rank andσr\\sigma\_\{r\}is rank variance across runs\. Lower NRRS indicates robust, high\-performing pipelines\.
## 4Experiments and Results
This section presents the experimental setup and results used to evaluate the proposed framework on the Pima and Stroke datasets, focusing on predictive performance and robustness across diverse pipeline configurations\. All experiments were conducted on a MacBook Air with Apple M4 chip, 16 GB RAM, running macOS 15\.6\.1\.
### 4\.1Datasets and Experimental Configuration
Experiments were conducted on two representative healthcare datasets to evaluate the effectiveness of the proposed framework under heterogeneous conditions\.
ThePima Indians Diabetes datasetconsists exclusively of numerical features and represents a small\-scale, moderately imbalanced binary classification task\. In contrast, theHealthcare Stroke datasetincludes both numerical and categorical features, requiring automatic encoding and presenting a more complex and highly imbalanced prediction scenario\.
A unified automated configuration system was used to generate all experiments\. The framework systematically explores combinations of the following components:
- •Normalization order:Norm\_First∈\{True,False\}\\in\\\{\\text\{True\},\\text\{False\}\\\}
- •Train–test splits:\{0\.1,0\.2\}\\\{0\.1,0\.2\\\}
- •Probability thresholds:\{0\.35,0\.5\}\\\{0\.35,0\.5\\\}
- •Models:SVM, Decision Tree, Logistic Regression, Random Forest, Gradient Boosting, XGBoost
- •Augmentation:None, Gaussian Noise, MixUp
- •Imbalance handling:None, SMOTE, ADASYN, Random Under\-Sampling, Tomek Links
- •Feature selection:infgain,biMeanInfgain,biMaxInfgain, or none
- •Feature count:upper\-range feature counts per dataset \(e\.g\., 4–8 for total 8 features, 5–9 for total 9 features\)\.
Four augmentation–imbalance modes were evaluated: \(1\) none, \(2\) augmentation only, \(3\) imbalance handling only, and \(4\) combined strategies\.
Performance was evaluated using Accuracy, Macro/Micro/Weighted Precision, Recall, F1\-score, and a composite Integrated Score\.
To assess robustness, experiments on the Pima dataset were repeated across multiple random seeds\. These seeds affect data splitting, augmentation sampling, and model initialization, enabling evaluation of performance stability and sensitivity to stochastic variation\. This multi\-seed design provides a more reliable estimate of generalization performance, particularly for small and imbalanced datasets\.
### 4\.2Branch\-Based Execution and Experimental Scale
Each unique configuration is defined as a*branch*, representing a fully specified machine\-learning pipeline\. Branches are generated through deterministic enumeration and executed independently, enabling structured comparison across pipeline configurations\.
In total, 18,720 branches were executed and grouped into 3,120 data\-branch collections, forming a large\-scale yet controlled experimental environment\. This scale allows systematic exploration of interactions between preprocessing, augmentation, imbalance handling, and model selection\.
### 4\.3Log\-Driven Execution and Data Aggregation
A key feature of the framework is its log\-driven architecture\. Each pipeline execution produces structured logs capturing configuration parameters, execution traces, prediction outputs, and evaluation metrics\. These logs are organized in a hierarchical directory structure, where each experiment is associated with a dedicatedLogDir, enabling full traceability from configuration to results\.
For the Pima dataset, six independent runs were conducted, while the Stroke dataset includes five runs\. Each run generates a correspondingLogDir, enabling systematic multi\-run analysis\.
Thestaticsanalysis\(\)module serves as the central aggregation mechanism, transforming raw logs into unified analytical datasets\. Its behavior is controlled by therunidparameter:
- •A specificrunidenables within\-run analysis\.
- •Multiplerunids enable cross\-run aggregation\.
- •Norunidaggregates all runs for global analysis\.
This design decouples execution from analysis, allowing reproducible post hoc evaluation without re\-running pipelines\.
##### Evaluation Protocol\.
The framework adopts a two\-level evaluation protocol:*within\-run*and*cross\-run*analysis\.
Within\-run analysisevaluates all pipeline configurations within a singleLogDir, where executions are deterministic given a fixed seed\. This enables controlled comparison of preprocessing strategies and model configurations\.
Cross\-run analysisaggregates results across independent runs with different random seeds, capturing variability from stochastic factors such as data splitting and model initialization\. Performance is summarized using mean, standard deviation, and ranking\-based measures \(e\.g\., NRRS\)\.
Statistical significance is assessed using non\-parametric tests, including the Friedman test and Wilcoxon signed\-rank test, ensuring robust comparison across pipeline configurations\.
### 4\.4Statistical Analysis
All statistical analyses in this study are derived from the aggregated logs and merged CSV datasets generated by theyvsoucom\-iterkitframework\. The availability of both raw per\-branch metrics and aggregated summaries enables comprehensive evaluation from multiple analytical perspectives\.
All experimental data, tables, figures, and generated artifacts are publicly available via the GitHub repository111[https://github\.com/yvsoucom/itekit\-examples](https://github.com/yvsoucom/itekit-examples)\. These resources support full reproducibility and allow direct inspection of both high\-performing and low\-performing pipeline instances\.
The statistical evaluation is organized into five complementary analytical components:
\(1\) Branch\-level distribution and trend analysis\.This analysis examines the full distribution of performance across all pipeline branches, revealing clustering behavior, dispersion patterns, and structural properties of the search space\. It provides insight into stability and sensitivity across configurations\.
\(2\) Pipeline\-level performance analysis\.Top\- and bottom\-performing configurations are identified based on key evaluation metrics, including weighted precision, recall, F1\-score, and the integrated performance score\. This analysis highlights effective combinations of preprocessing strategies and models, providing direct guidance for optimal pipeline design\.
\(3\) Component\-level statistical analysis\.This analysis evaluates the average contribution of individual pipeline components—such as feature selection methods, normalization strategies, data augmentation techniques, imbalance\-handling approaches, and classifiers—across all configurations\. By aggregating results across branches, it isolates the general effect of each component independent of specific pipeline compositions\.
\(4\) Cross\-component interaction analysis\.This component investigates how different pipeline elements jointly influence performance\. Using merged datasets, interaction patterns between models, preprocessing strategies, and imbalance\-handling methods are analyzed to reveal dependencies that are not observable through isolated component evaluation\. This demonstrates that predictive performance is governed by interactions among components rather than individual factors alone\.
\(5\) Robustness and variability analysis\.This analysis evaluates the stability of pipeline performance across repeated runs and different random seeds\. Statistical measures such as mean and variance are used to assess consistency, while non\-parametric statistical tests further validate observed differences between pipeline configurations\. This ensures that reported results reflect reliable and reproducible behavior rather than stochastic variation\.
Representative figures and summary tables are included in the main text, while complete experimental results, detailed statistical outputs, and full merged datasets are provided in the Appendix to ensure transparency and enable further investigation\.
#### 4\.4\.1Raw Metric Distributions and Branch\-Level Trends
This section provides a fine\-grained analysis of pipeline behavior by examining raw metric distributions and performance trends across all evaluated branches\. Unlike aggregated summaries, this analysis exposes the full variability and structural characteristics of the explored configuration space\.
Each branch corresponds to a fully specified pipeline instance\. By analyzing metric distributions across all branches, it is possible to observe clustering behavior, dispersion patterns, and the presence of outlier configurations\. These properties provide insight into both the stability of the framework and the sensitivity of performance to configuration choices\.
##### Raw Metric Analysis\.
For both the Pima and Stroke datasets, the highest F1 scores are consistently observed on the majority class \(Class 0\), which is expected due to class imbalance\.
Table[Supplementary A\.1](https://arxiv.org/html/2605.21528#Ax1.T1)reports class\-wise F1 scores for the Pima dataset using the abbreviated notation defined in the Pima Supplementary Materials\. Detailed results for the Stroke dataset are also provided in the Stroke Supplementary Materials\.
For the Pima dataset \(Table[Supplementary A\.1](https://arxiv.org/html/2605.21528#Ax1.T1)\), Class 0 F1 ranges from 0\.905 to 0\.909, Weighted\-F1 from 0\.865 to 0\.885, and Macro\-F1 from 0\.848 to 0\.876, indicating relatively balanced predictive performance across classes\.
In contrast, the Stroke dataset shows very high Class 0 F1 \(0\.977–0\.978\) and Weighted\-F1 \(0\.936–0\.943\), but substantially lower Macro\-F1 \(0\.562–0\.622\), indicating degraded performance on minority classes and stronger class imbalance effects\.
Table 1:Top 5 Class\-wise F1 Scores on Pima DatasetF1CRunIDFeatFSScAugImbModelAccW\-F1M\-F1NormSplitProbSeed0\.909007544bMaxStdMXnIMBSVM0\.8830\.8840\.873L0\.10\.351260\.909007544bMeanStdMXnIMBSVM0\.8830\.8840\.873F0\.10\.351260\.906019195bMeanStdMXnIMBRF0\.8700\.8660\.849L0\.10\.5070\.906007544bMeanStdMXnIMBGB0\.8700\.8660\.849L0\.10\.501260\.905014116bMaxStdNATomekXGB0\.8830\.8850\.876L0\.10\.35126
##### Trend and Geometric Analysis\.
Branch\-level distributions reveal structured, quasi\-cyclic variation in both datasets when projected along the branch axis, reflecting systematic differences across pipeline configurations\.
For the Pima dataset, performance exhibits relatively regular fluctuations with alternating levels of dispersion\. Branches form sequences of moderate and higher deviation, indicating phased variation across configuration groups\. This suggests a stable and low\-sensitivity landscape, where changes in certain components introduce limited and predictable variation\.
In contrast, the Stroke dataset shows a bounded concentration–dispersion pattern\. Metric values remain within relatively stable upper and lower bounds, while intermediate configurations exhibit higher dispersion before reconverging toward these boundaries\. This behavior reflects stronger sensitivity to pipeline design, particularly under class imbalance, with alternating phases of divergence and stabilization\.
Overall, Pima demonstrates a more uniform and stable distribution, whereas Stroke exhibits boundary\-constrained and metric\-sensitive variability\. The recurrence of near\-identical metric values across branches further indicates the presence of functionally equivalent configurations, suggesting opportunities for search space reduction in future experiments\.
Full visualizations supporting these observations are provided in the Pima Supplementary Materials and Stroke Supplementary Materials \.
#### 4\.4\.2Pipeline\-Level and Aggregate Performance Analysis
Analysis of the top\-ranked pipelines on the Pima dataset \(Table[Supplementary B\.3](https://arxiv.org/html/2605.21528#Ax1.T3)\) reveals distinct optimization characteristics and structural properties of the search space\. Full visualizations for both the Pima and Stroke datasets are provided in the Pima and Stroke Supplementary Materials\.
##### Pima Dataset\.
Top\-performing pipelines exhibit strong convergence across metrics, with Accuracy, Micro\-F1, and Weighted\-F1 consistently around0\.88310\.8831, and Macro\-F1 up to0\.87640\.8764\.
High\-performing configurations are dominated byXGBoostandSVM\. The best Macro Precision \(0\.90320\.9032\) is achieved by an SVM pipeline \(4\+ infgain \+ gaussian\_noise \+ noImbl \+ standard,Split=0\.1\\texttt\{Split\}=0\.1,P\_th=0\.35\\texttt\{P\\\_th\}=0\.35\), while the best Macro Recall \(0\.89300\.8930\) and Macro\-F1 \(0\.87640\.8764\) are obtained by XGBoost pipelines using6\+biMean/biMax Infgain \+ noAug \+ TomekLinks \+ standardwith the sameSplitandP\_th\. Micro and Weighted metrics \(≈0\.8831\\approx 0\.8831–0\.88500\.8850\) are also dominated by these XGBoost configurations, though SVM pipelines achieve identical values in several cases\.
Across top\-ranked pipelines, consistent patterns emerge:Split=0\.1\\texttt\{Split\}=0\.1andP\_th=0\.35\\texttt\{P\\\_th\}=0\.35are dominant,biMean/biMax Infgainis frequently used, and imbalance handling \(noImbl,TomekLinks\) shows limited impact\. Augmentation differs by model, withnoAugcommon in XGBoost andmixupin SVM\.
##### Stroke Dataset\.
Top\-ranked pipelines on the Stroke dataset achieve high majority\-class performance, with Accuracy and Micro\-F1 of0\.95340\.9534–0\.95370\.9537and Weighted\-F1 up to0\.94240\.9424, but substantially lower Macro\-F1 \(0\.65110\.6511–0\.65600\.6560\)\.
XGBoost pipelines \(e\.g\.,6\+infgain \+ noAUG \+ noIMB \+ Standard\) achieve the best Macro Precision \(0\.97650\.9765\), typically withSplit=0\.1,P\_th=0\.5\. In contrast, Logistic Regression with imbalance handling \(e\.g\.,SMOTE,ADASYN\) achieves higher Macro Recall \(0\.7800\) and Macro\-F1 \(0\.6560\), usually withSplit=0\.2,P\_th=0\.5\.
Micro\-F1 and Accuracy \(≈0\.9537\\approx 0\.9537\) are consistently obtained by Gradient Boosting or Logistic Regression pipelines withmixup \+ noIMB, often usingP\_th=0\.35\. Weighted\-F1 \(0\.93990\.9399–0\.94240\.9424\) is dominated by XGBoost without imbalance handling\.
No single optimal configuration emerges:Split=0\.1–0\.2 andP\_th=0\.35–0\.5 vary by metric\.
Aggregate results in the Pima and Stroke Supplementary Materials show clear differences across the same metrics on two datasets\. For Macro\-F1, Pima achieves0\.65500\.6550\(Std0\.09200\.0920\) versus Stroke0\.51160\.5116\(Std0\.04820\.0482\)\. For Weighted\-F1, Pima reaches0\.67070\.6707\(Std0\.10110\.1011\), while Stroke is substantially higher at0\.85110\.8511\(Std0\.09640\.0964\)\. Accuracy is also higher but more variable in Stroke \(0\.81810\.8181, Std0\.14500\.1450\) compared to Pima \(0\.67360\.6736, Std0\.08970\.0897\), indicating stronger majority\-class dominance and greater variability\.
These differences are consistent with the behavior of top\-ranked pipelines\. In Pima, top configurations achieve high Macro\-F1 \(up to0\.87640\.8764\), creating a wider performance gap relative to the overall mean and resulting in higher variance \(Std0\.09200\.0920\)\. In contrast, Stroke shows consistently lower Macro\-F1 \(mean0\.51160\.5116, best≈0\.6560\\approx 0\.6560\), leading to a more compressed distribution and lower variance \(Std0\.04820\.0482\)\. This suggests that pipeline configurations have a stronger impact on class\-balanced performance in Pima, whereas in Stroke, performance is more constrained by class imbalance\.
Table 2:Top\-ranked pipelines under different evaluation metrics on the Pima Indians Diabetes dataset\. P\_th represents probability threshold\. M\_ for Macro, m\_ for Micro, W\_ for Weighted metrics, , R\_S for Random Seed\.RankMetricValueLogDirSplit\_RatioP\_thR\_S1M\_P0\.90324/infgain/gaussian\_noise\_\_noImbl\_\_standard/sklearn\_SVM0\.10\.3523M\_R0\.89306/biMeanInfgain/noAug\_\_TomekLinks\_\_standard/XGBmodel0\.10\.35126M\_F10\.87646/biMaxInfgain/noAug\_\_TomekLinks\_\_standard/XGBmodel0\.10\.35126m\_P0\.88316/biMaxInfgain/noAug\_\_TomekLinks\_\_standard/XGBmodel0\.10\.35126m\_R0\.88316/biMaxInfgain/noAug\_\_TomekLinks\_\_standard/XGBmodel0\.10\.35126m\_F10\.88314/biMeanInfgain/standard\_\_mixup\_\_noImbl/sklearn\_SVM0\.10\.35126W\_P0\.89446/biMaxInfgain/noAug\_\_TomekLinks\_\_standard/XGBmodel0\.10\.35126W\_R0\.88316/biMaxInfgain/noAug\_\_TomekLinks\_\_standard/XGBmodel0\.10\.35126W\_F10\.88506/biMeanInfgain/noAug\_\_TomekLinks\_\_standard/XGBmodel0\.10\.35126Acc0\.88316/biMaxInfgain/noAug\_\_TomekLinks\_\_standard/XGBmodel0\.10\.351262M\_P0\.89066/biMeanInfgain/standard\_\_gaussian\_noise\_\_noImbl/LogisticRegression0\.10\.57M\_R0\.89306/biMaxInfgain/noAug\_\_TomekLinks\_\_standard/XGBmodel0\.10\.35126M\_F10\.87646/biMeanInfgain/noAug\_\_TomekLinks\_\_standard/XGBmodel0\.10\.35126m\_P0\.88314/biMaxInfgain/mixup\_\_noImbl\_\_standard/sklearn\_SVM0\.10\.35126m\_R0\.88314/biMaxInfgain/mixup\_\_noImbl\_\_standard/sklearn\_SVM0\.10\.35126m\_F10\.88316/biMaxInfgain/noAug\_\_TomekLinks\_\_standard/XGBmodel0\.10\.35126W\_P0\.89446/biMeanInfgain/noAug\_\_TomekLinks\_\_standard/XGBmodel0\.10\.35126W\_R0\.88314/biMaxInfgain/mixup\_\_noImbl\_\_standard/sklearn\_SVM0\.10\.35126W\_F10\.88506/biMaxInfgain/noAug\_\_TomekLinks\_\_standard/XGBmodel0\.10\.35126Acc0\.88314/biMaxInfgain/mixup\_\_noImbl\_\_standard/sklearn\_SVM0\.10\.351263M\_P0\.89065/biMeanInfgain/minmax\_\_gaussian\_noise\_\_noImbl/sklearn\_SVM0\.10\.5126M\_R0\.88305/biMaxInfgain/noAug\_\_noImbl\_\_standard/XGBmodel0\.10\.35126M\_F10\.87274/biMaxInfgain/mixup\_\_noImbl\_\_standard/sklearn\_SVM0\.10\.35126m\_P0\.88316/biMeanInfgain/noAug\_\_TomekLinks\_\_standard/XGBmodel0\.10\.35126m\_R0\.88316/biMeanInfgain/noAug\_\_TomekLinks\_\_standard/XGBmodel0\.10\.35126m\_F10\.88314/biMaxInfgain/mixup\_\_noImbl\_\_standard/sklearn\_SVM0\.10\.35126W\_P0\.88687/biMaxInfgain/standard\_\_mixup\_\_TomekLinks/random\_forest0\.10\.35126W\_R0\.88316/biMeanInfgain/noAug\_\_TomekLinks\_\_standard/XGBmodel0\.10\.35126W\_F10\.88364/biMeanInfgain/standard\_\_mixup\_\_noImbl/sklearn\_SVM0\.10\.35126Acc0\.88316/biMeanInfgain/noAug\_\_TomekLinks\_\_standard/XGBmodel0\.10\.351264M\_P0\.88466/infgain/standard\_\_gaussian\_noise\_\_noImbl/LogisticRegression0\.10\.5126M\_R0\.88305/biMeanInfgain/noAug\_\_noImbl\_\_standard/XGBmodel0\.10\.35126M\_F10\.87274/biMeanInfgain/standard\_\_mixup\_\_noImbl/sklearn\_SVM0\.10\.35126m\_P0\.88314/biMeanInfgain/standard\_\_mixup\_\_noImbl/sklearn\_SVM0\.10\.35126m\_R0\.88314/biMeanInfgain/standard\_\_mixup\_\_noImbl/sklearn\_SVM0\.10\.35126m\_F10\.88316/biMeanInfgain/noAug\_\_TomekLinks\_\_standard/XGBmodel0\.10\.35126W\_P0\.88687/infgain/standard\_\_noAug\_\_TomekLinks/XGBmodel0\.10\.35126W\_R0\.88314/biMeanInfgain/standard\_\_mixup\_\_noImbl/sklearn\_SVM0\.10\.35126W\_F10\.88364/biMaxInfgain/mixup\_\_noImbl\_\_standard/sklearn\_SVM0\.10\.35126Acc0\.88314/biMeanInfgain/standard\_\_mixup\_\_noImbl/sklearn\_SVM0\.10\.351265M\_P0\.88465/biMeanInfgain/gaussian\_noise\_\_noImbl\_\_standard/LogisticRegression0\.10\.57M\_R0\.88305/biMeanInfgain/standard\_\_noAug\_\_noImbl/XGBmodel0\.10\.35126M\_F10\.86355/biMaxInfgain/noAug\_\_noImbl\_\_standard/XGBmodel0\.10\.35126m\_P0\.87014/biMaxInfgain/standard\_\_noAug\_\_noImbl/XGBmodel0\.10\.5126m\_R0\.87014/biMaxInfgain/standard\_\_noAug\_\_noImbl/XGBmodel0\.10\.5126m\_F10\.87014/biMeanInfgain/minmax\_\_mixup\_\_noImbl/sklearn\_SVM0\.10\.5126W\_P0\.88555/biMeanInfgain/noAug\_\_noImbl\_\_standard/XGBmodel0\.10\.35126W\_R0\.87014/biMaxInfgain/standard\_\_noAug\_\_noImbl/XGBmodel0\.10\.5126W\_F10\.87255/biMaxInfgain/standard\_\_noAug\_\_noImbl/XGBmodel0\.10\.35126Acc0\.87014/biMaxInfgain/standard\_\_noAug\_\_noImbl/XGBmodel0\.10\.5126Table 3:Aggregate performance statistics across all pipeline configurations on the Pima dataset\. Results are reported as mean and standard deviation over all branches\.MetricMeanStdAccuracy0\.6736493830243830\.0897363795392836Macro\_Precision0\.68386013418304060\.0691823082366285Macro\_Recall0\.68268925012530360\.0755243216727676Macro\_F10\.65500641654067460\.0919794462164404Weighted\_Precision0\.72463195260492050\.0672222636465707Weighted\_Recall0\.6736493830243830\.0897363795392836Weighted\_F10\.67067591096932640\.1011450473334413Micro\_Precision0\.6736493830243830\.0897363795392836Micro\_Recall0\.6736493830243830\.0897363795392836Micro\_F10\.6736493830243830\.0897363795392836Integrated\_Score0\.67743023602757970\.0818401652216246
#### 4\.4\.3Pipeline Component Analysis: RF\-Based Importance and Component\-Specific Analysis
##### RF\-Based Component Importance \(Component Level\)\.
Random Forest importance results \(Figure[Supplementary C\.4](https://arxiv.org/html/2605.21528#Ax1.F4)and the Pima and Stroke Supplementary Materials\) quantitatively confirm the component\-level trends identified in the metric analysis\.
Pima dataset:Macro\-F1: AugMethod \(0\.4540\.454\), Model \(0\.1980\.198\), imblMethod \(0\.1010\.101\)\. Macro\-Precision: AugMethod \(0\.4000\.400\), Model \(0\.3440\.344\), imblMethod \(0\.1010\.101\)\. Macro\-Recall: Model \(0\.4310\.431\), AugMethod \(0\.2720\.272\), imblMethod \(0\.0620\.062\)\.
Overall, AugMethod emerges as the most consistently influential component across Macro\-F1 and Macro\-Precision, while Model contributes strongly across all metrics, particularly Macro\-Recall\.
Stroke dataset:Macro\-F1: imblMethod \(0\.4060\.406\), NormOrder \(0\.1530\.153\), Model \(0\.1380\.138\), AugMethod \(0\.1100\.110\)\. Macro\-Precision: Model \(0\.2150\.215\), AugMethod \(0\.1410\.141\), imblMethod \(0\.1000\.100\), InfoGain \(0\.0950\.095\)\. Macro\-Recall: imblMethod \(0\.6850\.685\), Model \(0\.1570\.157\)\.
Figure 2:Random Forest\-based component importance for the Pima dataset\.
##### RF Importance \(Detailed Parts Level\)\.
Detailed importance analysis \(Figure[Supplementary C\.5](https://arxiv.org/html/2605.21528#Ax1.F5)and the Pima and Stroke Supplementary Materials\) reveals fine\-grained component effects beyond aggregated metrics\.
In the Pima dataset, augmentation\-related methods dominate, withGaussian noiseemerging as the most influential component, whilemixupandnoAugshow negligible influence\. Model variants, such asDTmodel, also contribute substantially\.
In contrast, the Stroke dataset is dominated by imbalance handling, withRandomUnderSampleras the most influential component, while other components contribute minimally\.
Exact importance values are provided in the experimental logs\. Random Forest importance reflects sensitivity of performance to component perturbations rather than direct performance improvement or degradation\.
Figure 3:Detailed Random Forest importance analysis for the Pima dataset\.
##### Component\-Specific Analysis\.
The unified heatmaps \(Figure[Supplementary C\.3](https://arxiv.org/html/2605.21528#Ax1.F3)and the Pima and Stroke Supplementary Materials\) provide a component\-level view of pipeline behavior across datasets\.
In the Pima dataset, performance is influenced by multiple components in a relatively balanced manner\. Augmentation \(e\.g\.,gaussian\_noise\) consistently degrades performance across Macro\-F1 \(0\.580\.58\), while configurations such asmixupandnoAugachieve higher Macro\-F1 \(up to0\.710\.71\), indicating that no augmentation is often more effective\. Model choice introduces moderate variation in Macro\-F1 \(e\.g\.,sklearn\_SVM0\.700\.70vs\.DTmodel0\.620\.62\), while threshold tuning and split ratio also affect performance \(Macro\-F10\.660\.66–0\.680\.68\)\. In contrast, scaling methods such asstandardandminmaxshow minimal variation\.
In the Stroke dataset,TomekLinksandSMOTEachieve the highest Macro\-F1 \(up to∼0\.53\\sim 0\.53\), while model choice \(e\.g\.,logistic regression\) exhibits the lowest Macro\-F1 \(around∼0\.49\\sim 0\.49\)\.
Figure 4:pima unified metric mega heat map for branches
#### 4\.4\.4Value\-Level Component Similarity Analysis
Tables[Supplementary D\.7](https://arxiv.org/html/2605.21528#Ax1.T7)summarize the value\-level similarity analysis for the Pima dataset\. Ranking value\-level RMS similarities reveals structured redundancy patterns across pipeline components\.
##### Similarity\-Based Insights\.
Feature Dimensionality:Configurations with 4–8 selected features form a compact similarity group, indicating reduced sensitivity to the exact number of selected features within this range \(RMS: 0\.0342–0\.0414\)\.
Feature Selection:FS\-BMax and FS\-BMean exhibit highly similar behavior \(RMS: 0\.0252\), suggesting interchangeable functionality\.
Data Transformation:Mixup shows high similarity to no augmentation \(RMS: 0\.0279\), indicating limited additional effect in the evaluated setting, whereas Gaussian noise produces larger deviations\.
Imbalance Handling:TomekLinks and noIMB form a similarity pair with low RMS distance \(0\.0325\)\.
Model Level:Boosting\-based methods \(GB and XGB\) exhibit strong similarity \(RMS: 0\.0319\), whereas tree\-based and linear models show larger divergence, reflecting fundamentally different decision boundaries\.
Other Components:Normalization order, probability thresholds, and data split ratios yield relatively small RMS differences, indicating limited sensitivity within the explored search space\.
##### Integrated Future Experiment Configuration Strategy\.
We integrate component importance analysis with value\-level similarity and performance\-based evaluation to support principled AutoML search space reduction\.
At the component level, we assess each pipeline component’s importance based on its overall impact on performance\. Components with consistently low importance are considered for simplification\.
At the value level, two complementary criteria are applied: \(1\)Similarity\-based pruning:groups values with low RMS differences, allowing representative configurations to be selected\. \(2\)Performance\-aware selection:prioritizes values that contribute to high\-performing configurations while deprioritizing consistently weak performers\.
This combined strategy reduces redundancy while preserving diversity among high\-impact components, enabling more efficient and effective AutoML search space exploration\.
Table 4:Value\-Level Similarity \(RMS\) across all pipeline components on Pima datasetAbbreviations:F denotes the number of selected features\. FS denotes feature selection methods: BMax \(biMaxInfgain\), BMean \(biMeanInfgain\), IG \(information gain\), None \(no selection\)\. Sc denotes scaling: MinMax \(min–max normalization\), Std \(standardization\)\. Aug denotes data augmentation: GN \(Gaussian noise\), Mix \(mixup\), None\. Imb denotes imbalance handling: ADA \(ADASYN\), RUS \(random undersampling\), SMOTE, TL \(Tomek links\), None\. Models include DT \(decision tree\), GB \(gradient boosting\), LR \(logistic regression\), XGB, RF \(random forest\), and SVM\. Norm denotes normalization order \(F: first, L: last\)\. Thr denotes probability threshold, and Split denotes test split ratio\.
\(a\) Feature and Feature Selection Components
F:4F:5F:6F:7F:8FS:biMaxFS:biMeanFS:infFS:noneF:40\.03810\.03970\.04140\.0405F:50\.03810\.03420\.03640\.0388F:60\.03970\.03420\.03450\.0373F:70\.04140\.03640\.03450\.0355F:80\.04050\.03880\.03730\.0355FS:biMax0\.02520\.03940\.0405FS:biMean0\.02520\.03920\.0411FS:inf0\.03940\.03920\.0458FS:none0\.04050\.04110\.0458
\(b\) Data Transformation Components
Sc:minmaxSc:stdAug:noiseAug:mixAug:noneImb:ADAImb:RUSImb:SMOTEImb:TomekImb:noneSc:minmax0\.0438Sc:std0\.0438Aug:noise0\.10450\.1028Aug:mix0\.10450\.0279Aug:none0\.10280\.0279Imb:ADA0\.04350\.03830\.06480\.0672Imb:RUS0\.04350\.03950\.05670\.0595Imb:SMOTE0\.03830\.03950\.05480\.0571Imb:Tomek0\.06480\.05670\.05480\.0325Imb:none0\.06720\.05950\.05710\.0325
\(c\) Model and Decision Components
M:DTM:GBM:LRM:XGBM:RFM:SVMNorm:firstNorm:lastThr:\.35Thr:\.5Split:\.1Split:\.2M:DT0\.07990\.09700\.07920\.07360\.0992M:GB0\.07990\.05100\.03190\.04220\.0510M:LR0\.09700\.05100\.05230\.05810\.0384M:XGB0\.07920\.03190\.05230\.04190\.0514M:RF0\.07360\.04220\.05810\.04190\.0596M:SVM0\.09920\.05100\.03840\.05140\.0596Norm:first0\.0349Norm:last0\.0349Thr:\.350\.0566Thr:\.50\.0566Split:\.10\.0472Split:\.20\.0472
#### 4\.4\.5Cross\-Component Interaction Analysis
To characterize the structure of the AutoML search space, we analyze part–part cross\-component correlation matrices on both the Pima and Stroke datasets\. These matrices quantify statistical dependencies between pipeline components by measuring the co\-variation of their induced performance across all evaluated configurations\.
Given the large number of components in the pipelines, analyzing all pairwise correlations can be overwhelming and less informative\. Therefore, we focus on selected subsets of components to highlight meaningful interactions, including those with high mean correlation, high standard deviation of correlation, and clustered groups identified from the full correlation structure\. Within\-component correlations are zero by construction due to the branch\-based pipeline design, where only one value per component is active in each configuration\. As a result, correlation analysis is meaningful only across different components\.
For the Pima dataset, Figure[5](https://arxiv.org/html/2605.21528#S4.F5)presents clustered part\-to\-part correlations, while additional correlation analyses are provided in the Pima supplementary material\.
Across components, interaction strengths vary by method group\. Decision tree\-based models \(DTModel\) show moderate\-to\-high correlations \(0\.42–0\.88\), with typical values around 0\.68\. Imbalance\-handling methods such asTomekLinksexhibit a wider range \(0\.54–0\.97\), with most correlations centered around 0\.66\. The no\-imbalance configuration \(noIbml\) shows similar variability \(0\.40–0\.93\), with an average level near 0\.65\. Overall, all components demonstrate consistently moderate to high cross\-component correlation, indicating strong structural coupling within the pipeline design space\.
For the Stroke dataset, the corresponding correlation analyses—including top\-10 mean correlation patterns, high\-variance correlation patterns, and clustered structures—are provided in the Stroke supplementary material\.
The RandomUnderSampler exhibits correlation values ranging from 0\.74 to 0\.91, with most values concentrated around 0\.81, indicating relatively strong but more variable interactions compared to the Pima dataset\.
These analyses help identify groups of preprocessing, augmentation, normalization, and modeling components that tend to co\-vary in their effects\. By focusing on subsets of high\-impact or highly variable components, we reduce complexity while highlighting the most influential interactions\. This approach provides insight into synergistic and redundant pipeline structures and supports future optimization strategies, including higher\-order component interactions\.
Figure 5:pima cluster10 part vs part heat map
#### 4\.4\.6Cross\-Seed Robustness Analysis
To rigorously evaluate the stability and reliability of pipeline performance under stochastic conditions, each configuration was executed across multiple random seeds, affecting data splitting, augmentation and imbalance sampling, as well as model initialization\. This experimental design enables a comprehensive robustness assessment by jointly considering predictive performance and variability\. All results in this subsection are reported for thePima datasetonly and are further detailed in the Pima Supplementary Materials\.
##### Performance and Stability Analysis Under Stochastic Variation\.
Figure[Supplementary F\.16](https://arxiv.org/html/2605.21528#Ax1.F16)illustrates the joint distribution of mean Macro\-F1 and standard deviation across models, highlighting the trade\-off between predictive performance and stability\. SVM achieves the highest Macro\-F1 \(0\.689\) but also exhibits the highest variability \(σ≈0\.029\\sigma\\approx 0\.029\), indicating strong sensitivity to stochastic perturbations\. In contrast, ensemble methods \(XGBoost, Gradient Boosting, Random Forest\) provide a more balanced trade\-off, achieving competitive Macro\-F1 \(0\.645–0\.659\) with moderate variability \(σ≈0\.023\\sigma\\approx 0\.023–0\.0260\.026\)\. Decision Trees exhibit the lowest variability \(σ≈0\.012\\sigma\\approx 0\.012\), reflecting high stability but limited predictive performance \(Macro\-F1≈0\.652\\approx 0\.652\), while Logistic Regression lies in an intermediate regime \(Macro\-F1≈0\.671\\approx 0\.671, variabilityσ≈0\.022\\sigma\\approx 0\.022\), achieving slightly higher mean performance than ensemble methods, although the difference remains within the range of cross\-seed variability\.
##### Statistical Significance Under Stochastic Variation\.
Friedman tests \(Table[Supplementary F\.8](https://arxiv.org/html/2605.21528#Ax1.T8)\) show statistically significant differences across models for all evaluation metrics \(χ2≈26\.38\\chi^\{2\}\\approx 26\.38–27\.9027\.90,p≪0\.05p\\ll 0\.05\), with a constant critical difference \(CD = 2\.53\), confirming that the observed ranking structure is consistent under stochastic variation\. The stability of test statistics across metrics further indicates a coherent and reproducible ranking pattern\. However, statistical significance does not imply equal robustness, as cross\-seed variability differs substantially across models\. Overall, ensemble methods provide the most reliable balance between performance and stability, while SVM prioritizes accuracy at the cost of robustness\.
Figure 6:Pima cross\-seed mean performance and standard deviation across random seeds\.Table 5:Friedman test results across evaluation metrics \(computed over random seeds\)\.Metricχ2\\chi^\{2\}pp\-valueCDAccuracy27\.813\.97×10−53\.97\\times 10^\{\-5\}2\.53Macro Precision27\.903\.80×10−53\.80\\times 10^\{\-5\}2\.53Macro Recall27\.055\.58×10−55\.58\\times 10^\{\-5\}2\.53Macro F127\.813\.97×10−53\.97\\times 10^\{\-5\}2\.53Weighted Precision27\.055\.58×10−55\.58\\times 10^\{\-5\}2\.53Weighted Recall27\.813\.97×10−53\.97\\times 10^\{\-5\}2\.53Weighted F127\.813\.97×10−53\.97\\times 10^\{\-5\}2\.53Micro Precision27\.813\.97×10−53\.97\\times 10^\{\-5\}2\.53Micro Recall27\.813\.97×10−53\.97\\times 10^\{\-5\}2\.53Micro F127\.813\.97×10−53\.97\\times 10^\{\-5\}2\.53Integrated Score26\.387\.53×10−57\.53\\times 10^\{\-5\}2\.53
## 5Discussion
The experimental results demonstrate that the proposedyvsoucom\-iterkitframework enables systematic, large\-scale, and reproducible benchmarking of tabular data prediction pipelines\. By explicitly exposing the configuration space across preprocessing, feature selection, augmentation, and model design, the framework not only identifies which pipelines perform best, but also provides insights into the factors driving performance through interpretable, log\-driven analysis\.
### 5\.1Distinctive Features of the Proposed Framework
A key contribution of this work is the design of theyvsoucom\-iterkitframework, which fundamentally shifts the AutoML paradigm\. Unlike conventional AutoML approaches that treat pipeline optimization as a black\-box search problem, the proposed framework models each configuration as a deterministic, traceable log entity\. This log\-centric approach ensures full transparency and reproducibility by maintaining a complete record of every experiment, configuration, and outcome\.
By embracing the full configuration space, the framework enables structured and exhaustive exploration across all components, from data preprocessing to model design\. This comprehensive search spans every step, from raw data manipulation to final model outputs, offering insights into how different configurations interact at each stage of the pipeline\. Instead of focusing solely on optimizing a single performance metric,yvsoucom\-iterkitfacilitates a full\-space search, providing both performance metrics and detailed analysis of how various components contribute to the pipeline’s effectiveness\. This approach empowers users to explore the entire configuration space, revealing the impact of choices in preprocessing techniques, feature representations, model choices on performance\.
The framework’s ability to trace results from raw data through to component\-level outcomes is a critical feature\. This capability allows users to conduct detailed, log\-driven analysis, examining both the overall pipeline and individual components\. By tracking how configuration settings impact each stage of the pipeline, users can gain valuable insights into the specific steps that drive improvements, ultimately leading to better model performance and more effective configurations\.
Additionally, the framework supports multi\-run and multi\-seed experimentation, enabling a comprehensive evaluation of performance variability and stability\. This approach shifts AutoML from a single\-run optimization paradigm to astatistically grounded experimental methodology, where conclusions are derived from aggregated evidence rather than isolated results\. By accounting for stochastic fluctuations, this rigorous methodology ensures that the findings are not biased, offering a more reliable basis for deploying models in real\-world, dynamic environments\.
A key advantage of this framework is its ability to explore the full configuration space, encompassing various components such as feature selection, augmentation, imbalance handling, and model choice\. This flexibility enables users to adjust settings at both the high level \(e\.g\., model selection\) and the component level \(e\.g\., preprocessing techniques\), allowing them to observe the impact of these adjustments on performance\. By experimenting with different configurations across the entire space, users gain deep insights into the specific components that drive optimal pipeline performance\.
### 5\.2Findings and Implications from Two Datasets: Insights Using theyvsoucom\-iterkitFramework
The results provide critical insights into the key performance drivers, component importance, and robustness characteristics of pipelines constructed using theyvsoucom\-iterkitframework\.
##### Top Performance\.
Analysis of raw metric distributions and branch\-level behavior reveals repetitive clustering patterns across pipelines, where multiple configurations produce near\-identical outcomes, indicating a redundant configuration space with many\-to\-one mappings\. The Pima dataset exhibits a more stable and low\-sensitivity landscape, while the Stroke dataset demonstrates stronger boundary\-constrained and imbalance\-sensitive fluctuations, together suggesting functionally equivalent configurations and a reducible search space\.
On the Pima dataset, top pipelines achieve Macro\-F1 up to 0\.8764, indicating relatively balanced class performance\. XGBoost and SVM dominate top rankings, where SVM achieves strong peak performance but exhibits higher variability across configurations and random seeds, while ensemble methods show greater robustness\.
In contrast, the Stroke dataset exhibits high Accuracy and Micro\-F1 \(0\.9534–0\.9537\) and Weighted\-F1 up to 0\.9424, but significantly lower Macro\-F1 \(0\.6511–0\.6560\), reflecting stronger class imbalance effects\.
A notable finding is that configurations effective on the Pima dataset do not directly transfer to Stroke\. Pima shows more stable and uniform behavior across experimental configurations, whereas Stroke requires adjusted configuration settings to achieve better performance, particularly due to higher sensitivity and class imbalance\.
##### Component Importance and Correlations\.
Random Forest\-based importance analysis reveals clear component\-level drivers of performance\. On the Pima dataset, augmentation \(AugMethod, Macro\-F1 0\.454\), model choice \( Macro\-F1 0\.198\), and imbalance handling \( Macro\-F1 0\.101\) are the most influential factors, indicating that performance is primarily governed by data transformation and model selection\. On the Stroke dataset, imbalance handling dominates \(Macro\-F1 0\.406\), followed by NormOrder , model choice and augmentation\.
Fine\-grained analysis confirms that augmentation strategies \(e\.g\., Gaussian noise vs\. mixup or no augmentation\) induce substantial performance variation on the Pima dataset, with Gaussian noise consistently degrading performance\.
Component similarity analysis on the Pima dataset further reveals strong redundancy in the configuration space\. Feature selection variants \(biMax–biMean\) exhibit low RMS distances \( 0\.0252\), mixup shows high similarity to no augmentation \(RMS: 0\.0279\), and TomekLinks and no\-imbalance are closely clustered \(0\.0325\), indicating interchangeable behaviors\. In contrast, augmentation \(Gaussian noise vs\. no augmentation ≈ 0\.10\) and model families show higher divergence, reflecting their stronger functional impact\.
Cross\-component correlation analysis confirms substantial structural coupling across the pipeline space\. On the Pima dataset, decision tree–based models \(DTModel\) exhibit moderate\-to\-high correlations \(≈0\.4–0\.9\), with typical values around 0\.68\. For the Stroke dataset,RandomUnderSamplershows correlations ranging from 0\.74 to 0\.91, concentrated around 0\.81, indicating relatively strong but more variable interactions compared to Pima\.
These results indicate that performance is governed by a small subset of interacting components—primarily model choice, augmentation strategy, and imbalance handling—while many preprocessing configurations are functionally redundant, thereby guiding the design of future experiment configuration settings toward focused, interaction\-aware optimization rather than independent tuning\.
##### Performance–Variability and Robustness\.
Cross\-seed analysis on the Pima dataset reveals a clear trade\-off between predictive performance and robustness\. SVM achieves the highest mean Macro\-F1 \(0\.689\) but also exhibits the largest variability \(σ≈0\.029\\sigma\\approx 0\.029\), indicating strong sensitivity to stochastic factors\. In contrast, ensemble methods \(XGBoost, Gradient Boosting, Random Forest\) achieve slightly lower but competitive performance \(0\.645–0\.659\) with more stable variability \(σ≈0\.023\\sigma\\approx 0\.023–0\.026\), providing a better balance between accuracy and robustness\. Decision Trees show the lowest variability \(σ≈0\.012\\sigma\\approx 0\.012\) but reduced performance, while Logistic Regression occupies an intermediate regime between SVM and ensemble methods\.
Friedman tests \(χ2≈26\\chi^\{2\}\\approx 26–28,p≪0\.05p\\ll 0\.05\) confirm statistically significant and consistent ranking differences across models, although robustness varies substantially\. This observation reflects a trade\-off between predictive performance and stability, where high\-capacity models such as SVM achieve strong predictive performance but exhibit higher sensitivity to stochastic perturbations, while ensemble methods improve stability through aggregation\. These results highlight the importance of multi\-seed evaluation and indicate that reliable model selection should balance performance and stability rather than relying on single\-seed outcomes\.
##### Search Space Structure, Sensitivity, and Implications for Future Experiment Design\.
This framework search space exhibits two distinct regions: a redundant region with highly correlated, interchangeable configurations, and a sensitive region where key components \(e\.g\., model choice and imbalance handling\) significantly affect performance\. This indicates that the effective dimensionality is much lower than the combinatorial space\.
Practically, ensemble methods provide the best balance of performance and robustness, while imbalance handling is critical for minority\-class prediction\. Feature richness should be preserved, and excessive preprocessing offers limited gains compared to model and imbalance choices\. Multi\-seed evaluation is essential to ensure stability and generalizability\.
Due to computational constraints, the current framework does not yet include neural networks, a wider range of augmentation and imbalance strategies, or more diverse feature selection methods\. While strong results are achieved on the Pima dataset, the Stroke dataset remains challenging, with substantially lower Macro\-F1 due to severe class imbalance and higher sensitivity\.
These findings provide clear guidance for future experiment design, emphasizing targeted expansion of high\-impact components \(e\.g\., imbalance handling and model diversity\) and more efficient exploration of the reduced effective search space, thereby supporting more reliable deployment in dynamic real\-world environments\.
### 5\.3Limitations and Future Work
This study is limited to two datasets, and additional datasets are required to establish broader generalizability\. Neural network models were not evaluated due to computational constraints, although future work may incorporate optimized architectures, particularly in data\-rich settings\.
Future work will focus on:
- •Extending the framework to multi\-modal and unstructured data, including imaging, video, and text\.
- •Enabling multi\-node parallel execution to scale pipeline evaluation across larger datasets\.
- •Exploring improved configuration strategies for challenging datasets \(e\.g\., Stroke\), with emphasis on imbalance handling\.
- •Conducting deployment\-oriented evaluations to assess real\-world clinical applicability\.
These directions support the development of a scalable, reproducible, and structure\-aware framework capable of efficiently identifying robust pipelines while maintaining transparency and interpretability across diverse biomedical datasets\.
## 6Conclusion
This study introducesyvsoucom\-iterkit, a deterministic and log\-driven AutoML framework that fundamentally redefines pipeline optimization as a fully reproducible, configuration\-level experimental system\. Unlike conventional black\-box AutoML approaches, the framework explicitly encodes each pipeline configuration as a traceable log entity, enabling complete transparency, fine\-grained component\-level analysis, and statistically grounded evaluation\.
By exposing the full preprocessing–model configuration space, the framework enables exhaustive yet structured exploration of pipeline interactions across feature selection, augmentation, imbalance handling, and model design\. This design transforms AutoML from a single\-objective optimization process into a systematic experimental paradigm for analyzing component behavior and interactions\.
Extensive evaluation over more than 18,000 pipeline configurations reveals a highly redundant search space, where many configurations converge to similar performance, indicating a substantially reduced effective dimensionality\. Across both datasets, performance is primarily governed by a small subset of interacting components, particularly model selection, imbalance handling, and data augmentation, while most preprocessing operations contribute marginal gains\.
Analysis of raw metric distributions and branch\-level behavior reveals repetitive clustering patterns across pipelines, where multiple configurations produce near\-identical outcomes, indicating a redundant configuration space with many\-to\-one mappings\. The Pima dataset exhibits a more stable and low\-sensitivity landscape, while the Stroke dataset demonstrates stronger boundary\-constrained and imbalance\-sensitive fluctuations, together suggesting functionally equivalent configurations and a reducible search space\.
On the Pima dataset, top pipelines achieve Macro\-F1 up to 0\.8764, with ensemble models such as XGBoost showing strong and stable performance, while SVM exhibits higher variability across seeds \(σ≈0\.029\\sigma\\approx 0\.029\)\. In contrast, the Stroke dataset achieves high Micro\-F1 \(0\.9534–0\.9537\) and Weighted\-F1 up to 0\.9424 but substantially lower Macro\-F1 \(0\.6511–0\.6560\), reflecting severe class imbalance effects\.
Component similarity analysis on the Pima dataset reveals strong redundancy in the configuration space\. Feature selection variants \(biMax–biMean\) exhibit low RMS distances \(0\.0252\), mixup is highly similar to no augmentation \(0\.0279\), and TomekLinks closely aligns with no imbalance handling \(0\.0325\), indicating largely interchangeable behaviors\. In contrast, augmentation strategies \(e\.g\., Gaussian noise vs\. no augmentation≈0\.10\\approx 0\.10\) and model families show greater divergence, reflecting their stronger functional impact on performance\.
Cross\-seed analysis further confirms a clear performance–robustness trade\-off, where ensemble methods achieve more stable variability \(σ≈0\.023\\sigma\\approx 0\.023–0\.026\) compared to SVM, which achieves higher peak performance but lower stability\. Statistical testing \(Friedmanχ2≈26\\chi^\{2\}\\approx 26–28,p≪0\.05p\\ll 0\.05\) validates significant and consistent model ranking differences across experimental runs\.
Component\-level analysis shows that augmentation \(AugMethod\), model choice, and imbalance handling are the dominant drivers of performance on the Pima dataset \(Macro\-F1 contributions 0\.454, 0\.198, and 0\.101 respectively\), while imbalance handling is most critical for the Stroke dataset \(0\.406\)\. These results are directly enabled by the framework’s log\-level traceability, which allows decomposition of performance into interpretable configuration components\.
Future work will extend the framework to neural architectures, multimodal datasets, and distributed execution, while enhancing imbalance\-aware and interaction\-aware optimization for real\-world clinical applications\.
## Acknowledgements
The authors declare that this work received no funding or financial support\. Hangzhou Domain Zones Technology Co\. Ltd\. had no involvement in the study design, data collection, analysis, interpretation of results, or manuscript preparation and submission\.
## Data and Code Availability
Example scripts, including all pipeline configurations, sample data, and experiment logs required to reproduce the experiments, are publicly accessible at:[https://github\.com/yvsoucom/itekit\-examples](https://github.com/yvsoucom/itekit-examples)\.
## Statement for studies involving humans and animals
This study does not involve any direct experimentation on humans or animals\. The datasets used are publicly available and fully anonymized; therefore, no ethical approval and informed consent were required\.
## Declaration of Competing Interest
One of the authors is affiliated with Hangzhou Domain Zones Technology Co\. Ltd\. The company had no role in the study design, data collection, analysis, interpretation of data, or in writing the manuscript\.
The authors declare that they have no other competing financial or non\-financial interests, no additional support beyond their primary affiliations, and no other activities or relationships that could be perceived to have influenced the submitted work\.
## Author Contributions
Rui Huang:Conceptualization, Methodology, Formal analysis, Validation, Investigation, Data curation, Visualization, Writing — original draft, review, and editing\.
Lican Huang:Methodology, Software, Formal analysis, Investigation, Data curation, Writing — original draft, review, and editing\.
All authors read and approved the final manuscript\.
## 7Declaration of generative AI and AI\-assisted technologies in the manuscript preparation process
Statement: During the preparation of this work the authors used ChatGPT in order to prepare and writing the draft paper\. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the content of the published article\.
## References
## Supplementary Materials Overview
This document presents thecomprehensive set of experimental results, analyses, and visualizationsthat support the findings reported in the main manuscript\. It includes detailed performance metrics, branch\-level trends, cross\-component interactions, cross\-seed variability, and value\-level component similarity analyses for both the Pima and Stroke datasets\.
The Supplementary Materials providetabular summaries, heatmaps, random\-forest\-based component importance analyses, correlation diagrams, and statistical comparisons, offering a complete view of pipeline behavior, component contributions, and model robustness across multiple experimental configurations\.
These results enable adeeper understanding of pipeline design, cross\-dataset comparisons, and the impact of preprocessing, data augmentation, imbalance handling, and model selectionon predictive performance\. They also allow readers to trace the effects of individual pipeline components, examine branch\-level variability, and evaluate model stability across different random seeds, providing a solid foundation for reproducible machine learning research\.
This PDF contains additional figures, tables, and analyses supporting the main manuscript\. Main figures and tables are reproduced here for completeness,\.
### Supplementary ARaw Metric Distributions and Branch\-Level Trends
This subsection presents theraw metric distributions for individual branchesof the Pima and Stroke datasets\. It includestop class\-wise F1 scoresfor the Pima dataset \(Table[Supplementary A\.1](https://arxiv.org/html/2605.21528#Ax1.T1)\) ,top class\-wise F1 scoresfor the Stroke dataset \(Table[Supplementary A\.2](https://arxiv.org/html/2605.21528#Ax1.T2)\) andvisual distributionsof metrics across all branches \(Figures[Supplementary A\.1](https://arxiv.org/html/2605.21528#Ax1.F1)and[Supplementary A\.2](https://arxiv.org/html/2605.21528#Ax1.F2)\)\.
These results provide insight into branch\-level performance variation, demonstrate the impact of different preprocessing, augmentation, and imbalance\-handling configurations, and highlight the best\-performing branches across multiple experimental setups\.
##### Abbreviations\.
C= class label \(0 = majority, 1 = minority\);Feat= number of features;FS= feature selection \(bMax = biMaxInfgain, bMean = biMeanInfgain, Inf = infgain\);Sc= scaler \(Std = Standard, MinMax = MinMax\);Aug= augmentation \(MX = mixup, GN = Gaussian noise, NA = none\);Imb= imbalance handling \(nIMB = none, Tomek = TomekLinks\);Norm= normalization order \(L = last, F = first\);Split= train/test ratio;Prob= probability threshold;Seed= random seed;Acc= accuracy;W\-F1= weighted F1;M\-F1= macro F1\.
Table Supplementary A\.1:Top 5 Class\-wise F1 Scores on Pima DatasetF1CRunIDFeatFSScAugImbModelAccW\-F1M\-F1NormSplitProbSeed0\.909007544bMaxStdMXnIMBSVM0\.8830\.8840\.873L0\.10\.351260\.909007544bMeanStdMXnIMBSVM0\.8830\.8840\.873F0\.10\.351260\.906019195bMeanStdMXnIMBRF0\.8700\.8660\.849L0\.10\.5070\.906007544bMeanStdMXnIMBGB0\.8700\.8660\.849L0\.10\.501260\.905014116bMaxStdNATomekXGB0\.8830\.8850\.876L0\.10\.35126Table Supplementary A\.2:Top 5 Class\-wise F1 Scores on Stroke DatasetF1CRunIDFeatFSScAugImbModelAccW\-F1M\-F1NormSplitProb0\.97808f237InfMinMXTomekGB0\.9570\.9430\.622T0\.10\.350\.97708f239bMaxMinGNTomekRF0\.9550\.9360\.563F0\.10\.500\.977027368bMeanStdGNnIMBLR0\.9550\.9360\.563F0\.10\.350\.977027369bMaxMinGNTomekXGB0\.9550\.9360\.563T0\.10\.350\.97708f238bMaxStdMXnIMBXGB0\.9550\.9360\.563F0\.10\.50Figure Supplementary A\.1:Per\-branch metric distribution for the Pima dataset\.Figure Supplementary A\.2:Per\-branch metric distribution for the Stroke dataset\.
### Supplementary BPipeline\-Level and Aggregate Performance Analysis
This subsection provides a detailed overview of theperformance of different pipeline configurationsevaluated on the Stroke and Pima datasets\. We present bothtop\-ranked pipelines for each evaluation metric\(Table[Supplementary B\.3](https://arxiv.org/html/2605.21528#Ax1.T3)\)and[Supplementary B\.4](https://arxiv.org/html/2605.21528#Ax1.T4)\) andaggregate performance statisticsacross all pipelines and branches \(Tables[Supplementary B\.5](https://arxiv.org/html/2605.21528#Ax1.T5)and[Supplementary B\.6](https://arxiv.org/html/2605.21528#Ax1.T6)\)\.
The top\-ranked pipelines highlight which combinations of preprocessing, feature selection, normalization, augmentation, imbalance\-handling, and model selection yield the best performance under metrics such as Accuracy, Macro/Micro/Weighted Precision, Recall, F1, and the Integrated Score\. By analyzing these pipelines, readers can identifypatterns in high\-performing configurations, including how normalization order, probability thresholds, and train\-test splits influence predictive outcomes\.
The aggregate statistics complement the top\-ranked pipelines by providingoverall trends and variabilityacross all configurations\. This includes mean and standard deviation for each metric, allowing for assessment of pipeline robustness and the stability of performance across branches and experimental settings\.
Together, these results provide a comprehensive view of pipeline behavior, enabling adeep understanding of design choices, cross\-dataset comparisons, and the impact of different preprocessing, augmentation, and modeling strategies on predictive performance\.
Table Supplementary B\.3:Top\-ranked pipelines under different evaluation metrics on the Pima Indians Diabetes dataset\. P\_th represents probability threshold\. M\_ for Macro, m\_ for Micro, W\_ for Weighted metrics, , R\_S for Random Seed\.RankMetricValueLogDirSplit\_RatioP\_thR\_S1M\_P0\.90324/infgain/gaussian\_noise\_\_noImbl\_\_standard/sklearn\_SVM0\.10\.3523M\_R0\.89306/biMeanInfgain/noAug\_\_TomekLinks\_\_standard/XGBmodel0\.10\.35126M\_F10\.87646/biMaxInfgain/noAug\_\_TomekLinks\_\_standard/XGBmodel0\.10\.35126m\_P0\.88316/biMaxInfgain/noAug\_\_TomekLinks\_\_standard/XGBmodel0\.10\.35126m\_R0\.88316/biMaxInfgain/noAug\_\_TomekLinks\_\_standard/XGBmodel0\.10\.35126m\_F10\.88314/biMeanInfgain/standard\_\_mixup\_\_noImbl/sklearn\_SVM0\.10\.35126W\_P0\.89446/biMaxInfgain/noAug\_\_TomekLinks\_\_standard/XGBmodel0\.10\.35126W\_R0\.88316/biMaxInfgain/noAug\_\_TomekLinks\_\_standard/XGBmodel0\.10\.35126W\_F10\.88506/biMeanInfgain/noAug\_\_TomekLinks\_\_standard/XGBmodel0\.10\.35126Acc0\.88316/biMaxInfgain/noAug\_\_TomekLinks\_\_standard/XGBmodel0\.10\.351262M\_P0\.89066/biMeanInfgain/standard\_\_gaussian\_noise\_\_noImbl/LogisticRegression0\.10\.57M\_R0\.89306/biMaxInfgain/noAug\_\_TomekLinks\_\_standard/XGBmodel0\.10\.35126M\_F10\.87646/biMeanInfgain/noAug\_\_TomekLinks\_\_standard/XGBmodel0\.10\.35126m\_P0\.88314/biMaxInfgain/mixup\_\_noImbl\_\_standard/sklearn\_SVM0\.10\.35126m\_R0\.88314/biMaxInfgain/mixup\_\_noImbl\_\_standard/sklearn\_SVM0\.10\.35126m\_F10\.88316/biMaxInfgain/noAug\_\_TomekLinks\_\_standard/XGBmodel0\.10\.35126W\_P0\.89446/biMeanInfgain/noAug\_\_TomekLinks\_\_standard/XGBmodel0\.10\.35126W\_R0\.88314/biMaxInfgain/mixup\_\_noImbl\_\_standard/sklearn\_SVM0\.10\.35126W\_F10\.88506/biMaxInfgain/noAug\_\_TomekLinks\_\_standard/XGBmodel0\.10\.35126Acc0\.88314/biMaxInfgain/mixup\_\_noImbl\_\_standard/sklearn\_SVM0\.10\.351263M\_P0\.89065/biMeanInfgain/minmax\_\_gaussian\_noise\_\_noImbl/sklearn\_SVM0\.10\.5126M\_R0\.88305/biMaxInfgain/noAug\_\_noImbl\_\_standard/XGBmodel0\.10\.35126M\_F10\.87274/biMaxInfgain/mixup\_\_noImbl\_\_standard/sklearn\_SVM0\.10\.35126m\_P0\.88316/biMeanInfgain/noAug\_\_TomekLinks\_\_standard/XGBmodel0\.10\.35126m\_R0\.88316/biMeanInfgain/noAug\_\_TomekLinks\_\_standard/XGBmodel0\.10\.35126m\_F10\.88314/biMaxInfgain/mixup\_\_noImbl\_\_standard/sklearn\_SVM0\.10\.35126W\_P0\.88687/biMaxInfgain/standard\_\_mixup\_\_TomekLinks/random\_forest0\.10\.35126W\_R0\.88316/biMeanInfgain/noAug\_\_TomekLinks\_\_standard/XGBmodel0\.10\.35126W\_F10\.88364/biMeanInfgain/standard\_\_mixup\_\_noImbl/sklearn\_SVM0\.10\.35126Acc0\.88316/biMeanInfgain/noAug\_\_TomekLinks\_\_standard/XGBmodel0\.10\.351264M\_P0\.88466/infgain/standard\_\_gaussian\_noise\_\_noImbl/LogisticRegression0\.10\.5126M\_R0\.88305/biMeanInfgain/noAug\_\_noImbl\_\_standard/XGBmodel0\.10\.35126M\_F10\.87274/biMeanInfgain/standard\_\_mixup\_\_noImbl/sklearn\_SVM0\.10\.35126m\_P0\.88314/biMeanInfgain/standard\_\_mixup\_\_noImbl/sklearn\_SVM0\.10\.35126m\_R0\.88314/biMeanInfgain/standard\_\_mixup\_\_noImbl/sklearn\_SVM0\.10\.35126m\_F10\.88316/biMeanInfgain/noAug\_\_TomekLinks\_\_standard/XGBmodel0\.10\.35126W\_P0\.88687/infgain/standard\_\_noAug\_\_TomekLinks/XGBmodel0\.10\.35126W\_R0\.88314/biMeanInfgain/standard\_\_mixup\_\_noImbl/sklearn\_SVM0\.10\.35126W\_F10\.88364/biMaxInfgain/mixup\_\_noImbl\_\_standard/sklearn\_SVM0\.10\.35126Acc0\.88314/biMeanInfgain/standard\_\_mixup\_\_noImbl/sklearn\_SVM0\.10\.351265M\_P0\.88465/biMeanInfgain/gaussian\_noise\_\_noImbl\_\_standard/LogisticRegression0\.10\.57M\_R0\.88305/biMeanInfgain/standard\_\_noAug\_\_noImbl/XGBmodel0\.10\.35126M\_F10\.86355/biMaxInfgain/noAug\_\_noImbl\_\_standard/XGBmodel0\.10\.35126m\_P0\.87014/biMaxInfgain/standard\_\_noAug\_\_noImbl/XGBmodel0\.10\.5126m\_R0\.87014/biMaxInfgain/standard\_\_noAug\_\_noImbl/XGBmodel0\.10\.5126m\_F10\.87014/biMeanInfgain/minmax\_\_mixup\_\_noImbl/sklearn\_SVM0\.10\.5126W\_P0\.88555/biMeanInfgain/noAug\_\_noImbl\_\_standard/XGBmodel0\.10\.35126W\_R0\.87014/biMaxInfgain/standard\_\_noAug\_\_noImbl/XGBmodel0\.10\.5126W\_F10\.87255/biMaxInfgain/standard\_\_noAug\_\_noImbl/XGBmodel0\.10\.35126Acc0\.87014/biMaxInfgain/standard\_\_noAug\_\_noImbl/XGBmodel0\.10\.5126Table Supplementary B\.4:Top\-ranked pipelines under different evaluation metrics on the Stroke prediction dataset\. P\_th represents probability threshold\. M\_ for Macro, m\_ for Micro, W\_ for Weighted metricsRankMetricValueLogDirNorm\_FirstSplit\_RatioP\_th1M\_P0\.97656/infgain/noAUG\_\_noIMB\_\_Standard/XGBmodelFalse0\.10\.5M\_R0\.78008/infgain/Standard\_\_mixup\_\_SMOTE/LogisticRegressionTrue0\.20\.5M\_F10\.65609/biMaxInfgain/noAUG\_\_SMOTE\_\_MinMax/LogisticRegressionFalse0\.20\.5m\_P0\.95378/infgain/MinMax\_\_mixup\_\_noIMB/GradientBoostingTrue0\.10\.35m\_R0\.95378/infgain/MinMax\_\_mixup\_\_noIMB/GradientBoostingTrue0\.10\.35m\_F10\.95378/infgain/MinMax\_\_mixup\_\_noIMB/GradientBoostingTrue0\.10\.35W\_P0\.955210/noSelect/noAUG\_\_TomekLinks\_\_Standard/XGBmodelFalse0\.10\.5W\_R0\.95378/infgain/MinMax\_\_mixup\_\_noIMB/GradientBoostingTrue0\.10\.35W\_F10\.942410/noSelect/noAUG\_\_noIMB\_\_MinMax/XGBmodelFalse0\.10\.35Acc0\.95378/infgain/MinMax\_\_mixup\_\_noIMB/GradientBoostingTrue0\.10\.352M\_P0\.97659/infgain/MinMax\_\_noAUG\_\_TomekLinks/LogisticRegressionTrue0\.20\.5M\_R0\.77927/infgain/Standard\_\_mixup\_\_RandomUnderSampler/LogisticRegressionTrue0\.20\.5M\_F10\.65609/biMaxInfgain/noAUG\_\_SMOTE\_\_Standard/LogisticRegressionFalse0\.20\.5m\_P0\.95349/infgain/MinMax\_\_mixup\_\_noIMB/LogisticRegressionTrue0\.20\.35m\_R0\.95349/infgain/MinMax\_\_mixup\_\_noIMB/LogisticRegressionTrue0\.20\.35m\_F10\.95347/infgain/mixup\_\_noIMB\_\_MinMax/LogisticRegressionFalse0\.20\.35W\_P0\.95529/infgain/MinMax\_\_noAUG\_\_TomekLinks/LogisticRegressionTrue0\.20\.5W\_R0\.95347/infgain/mixup\_\_noIMB\_\_MinMax/LogisticRegressionFalse0\.20\.35W\_F10\.942410/noSelect/MinMax\_\_noAUG\_\_noIMB/XGBmodelTrue0\.10\.35Acc0\.95349/infgain/MinMax\_\_mixup\_\_noIMB/LogisticRegressionTrue0\.20\.353M\_P0\.97656/infgain/MinMax\_\_noAUG\_\_noIMB/XGBmodelTrue0\.10\.5M\_R0\.77866/infgain/MinMax\_\_mixup\_\_SMOTE/LogisticRegressionTrue0\.20\.5M\_F10\.65609/biMeanInfgain/noAUG\_\_SMOTE\_\_MinMax/LogisticRegressionFalse0\.20\.5m\_P0\.95347/infgain/mixup\_\_noIMB\_\_MinMax/LogisticRegressionFalse0\.20\.35m\_R0\.95347/infgain/mixup\_\_noIMB\_\_MinMax/LogisticRegressionFalse0\.20\.35m\_F10\.95346/infgain/mixup\_\_TomekLinks\_\_Standard/LogisticRegressionFalse0\.20\.35W\_P0\.95526/infgain/noAUG\_\_noIMB\_\_Standard/XGBmodelFalse0\.10\.5W\_R0\.95347/infgain/mixup\_\_TomekLinks\_\_Standard/LogisticRegressionFalse0\.20\.35W\_F10\.942410/noSelect/Standard\_\_noAUG\_\_noIMB/XGBmodelTrue0\.10\.35Acc0\.95347/infgain/mixup\_\_noIMB\_\_MinMax/LogisticRegressionFalse0\.20\.354M\_P0\.97659/infgain/Standard\_\_noAUG\_\_TomekLinks/LogisticRegressionTrue0\.20\.5M\_R0\.77797/infgain/Standard\_\_mixup\_\_SMOTE/LogisticRegressionTrue0\.20\.5M\_F10\.65609/biMeanInfgain/noAUG\_\_SMOTE\_\_Standard/LogisticRegressionFalse0\.20\.5m\_P0\.95346/infgain/mixup\_\_TomekLinks\_\_Standard/LogisticRegressionFalse0\.20\.35m\_R0\.95346/infgain/mixup\_\_TomekLinks\_\_Standard/LogisticRegressionFalse0\.20\.35m\_F10\.95349/infgain/MinMax\_\_mixup\_\_noIMB/LogisticRegressionTrue0\.20\.35W\_P0\.95526/infgain/noAUG\_\_noIMB\_\_MinMax/XGBmodelFalse0\.10\.5W\_R0\.95349/infgain/MinMax\_\_mixup\_\_noIMB/LogisticRegressionTrue0\.20\.35W\_F10\.942410/noSelect/noAUG\_\_noIMB\_\_Standard/XGBmodelFalse0\.10\.35Acc0\.95346/infgain/mixup\_\_TomekLinks\_\_Standard/LogisticRegressionFalse0\.20\.355M\_P0\.97656/infgain/Standard\_\_noAUG\_\_noIMB/XGBmodelTrue0\.10\.5M\_R0\.77726/infgain/MinMax\_\_mixup\_\_ADASYN/LogisticRegressionTrue0\.20\.5M\_F10\.65119/biMeanInfgain/noAUG\_\_ADASYN\_\_Standard/LogisticRegressionFalse0\.20\.5m\_P0\.95347/infgain/mixup\_\_TomekLinks\_\_Standard/LogisticRegressionFalse0\.20\.35m\_R0\.95347/infgain/mixup\_\_TomekLinks\_\_Standard/LogisticRegressionFalse0\.20\.35m\_F10\.95347/infgain/mixup\_\_TomekLinks\_\_Standard/LogisticRegressionFalse0\.20\.35W\_P0\.95526/infgain/MinMax\_\_noAUG\_\_noIMB/XGBmodelTrue0\.10\.5W\_R0\.95346/infgain/mixup\_\_TomekLinks\_\_Standard/LogisticRegressionFalse0\.20\.35W\_F10\.93998/infgain/MinMax\_\_mixup\_\_noIMB/GradientBoostingTrue0\.10\.35Acc0\.95347/infgain/mixup\_\_TomekLinks\_\_Standard/LogisticRegressionFalse0\.20\.35Table Supplementary B\.5:Aggregate performance statistics across all pipeline configurations on the Pima dataset\. Results are reported as mean and standard deviation over all branches\.MetricMeanStdAccuracy0\.6736493830243830\.0897363795392836Macro\_Precision0\.68386013418304060\.0691823082366285Macro\_Recall0\.68268925012530360\.0755243216727676Macro\_F10\.65500641654067460\.0919794462164404Weighted\_Precision0\.72463195260492050\.0672222636465707Weighted\_Recall0\.6736493830243830\.0897363795392836Weighted\_F10\.67067591096932640\.1011450473334413Micro\_Precision0\.6736493830243830\.0897363795392836Micro\_Recall0\.6736493830243830\.0897363795392836Micro\_F10\.6736493830243830\.0897363795392836Integrated\_Score0\.67743023602757970\.0818401652216246Table Supplementary B\.6:Aggregate performance statistics across all pipeline configurations on the Stroke dataset\. Results are reported as mean and standard deviation over all branches\.MetricMeanStdAccuracy0\.8180586742658660\.1450322573703264Macro\_Precision0\.55897566308123490\.0599054344296001Macro\_Recall0\.60287856198632950\.0844300198038256Macro\_F10\.51162984302692830\.048158774525587Weighted\_Precision0\.92555303742881820\.0124443266608974Weighted\_Recall0\.8180586742658660\.1450322573703265Weighted\_F10\.85108098365087190\.0963836572731272Micro\_Precision0\.8180586742658660\.1450322573703264Micro\_Recall0\.8180586742658660\.1450322573703264Micro\_F10\.8180586742658660\.1450322573703264Integrated\_Score0\.74037081995765460\.0726826444494676
### Supplementary CPipeline Component Analysis: RF\-Based Importance and Component\-Specific Analysis
This subsection presents an analysis ofindividual pipeline componentsand their contributions to predictive performance, using Random Forest\-based importance scores and aggregated metric heatmaps\.
Figure[Supplementary C\.3](https://arxiv.org/html/2605.21528#Ax1.F3)provides component\-wise variability heatmaps, highlighting trends across branches, pipeline variations, and metric types for the Pima dataset\. Figure[Supplementary C\.4](https://arxiv.org/html/2605.21528#Ax1.F4)summarizes component\-level importance for the Pima dataset, showing which preprocessing, augmentation, and normalization steps most strongly influence predictive outcomes\.
Figure[Supplementary C\.5](https://arxiv.org/html/2605.21528#Ax1.F5)shows the mean importance values of detailed component\-level items for the Pima dataset \.
Figure[Supplementary C\.6](https://arxiv.org/html/2605.21528#Ax1.F6)provides component\-wise variability heatmaps, highlighting trends across branches, pipeline variations, and metric types for the Stroke dataset\. Figure[Supplementary C\.7](https://arxiv.org/html/2605.21528#Ax1.F7)summarizes component\-level importance for the Stroke dataset, showing which preprocessing, augmentation, and normalization steps most strongly influence predictive outcomes\. Figure[Supplementary C\.8](https://arxiv.org/html/2605.21528#Ax1.F8)shows the mean importance values of detailed component\-level items for the Stroke dataset\.
Together, these visualizations facilitate the identification of the most influential pipeline strategies and allow readers to observe how component choices interact with branch\-specific performance, providing a comprehensive view of pipeline behavior and component contributions across datasets\.
Figure Supplementary C\.3:pima unified metric mega heat map for branchesFigure Supplementary C\.4:Random Forest\-based component importance for the Pima dataset\.Figure Supplementary C\.5:Detailed Random Forest importance analysis for the Pima dataset\.Figure Supplementary C\.6:stroke unified metric mega heat map for branchesFigure Supplementary C\.7:Random Forest\-based component importance for the Stroke dataset\.Figure Supplementary C\.8:Detailed Random Forest importance analysis for the Stroke dataset\.
### Supplementary DValue\-Level Component Similarity Analysis
This subsection evaluatesvalue\-level component similaritywithin the Pima dataset using Root Mean Square \(RMS\) distance, identifying configurations that yieldfunctionally similar predictive behavior\.
Tables[Supplementary D\.7](https://arxiv.org/html/2605.21528#Ax1.T7)report pairwise similarities across different component categories for the Pima dataset, including:
Feature and Feature Selection components , Data Transformation components \(e\.g\., preprocessing, augmentation\) , and Model/Decision components \(e\.g\., classifiers, thresholds\)
These results highlightredundant or interchangeable component values, enabling pipeline simplification through pruning of low\-impact configurations and supporting a more consistent, robust pipeline configuration\.
Overall, this analysis provides afine\-grained view of intra\-component similarity, component equivalence, and interaction patternsspecific to the Pima dataset\.
Table Supplementary D\.7:Value\-Level Similarity \(RMS\) across all pipeline components on Pima dataset\(a\) Feature and Feature Selection Components
F:4F:5F:6F:7F:8FS:biMaxFS:biMeanFS:infFS:noneF:40\.03810\.03970\.04140\.0405F:50\.03810\.03420\.03640\.0388F:60\.03970\.03420\.03450\.0373F:70\.04140\.03640\.03450\.0355F:80\.04050\.03880\.03730\.0355FS:biMax0\.02520\.03940\.0405FS:biMean0\.02520\.03920\.0411FS:inf0\.03940\.03920\.0458FS:none0\.04050\.04110\.0458
\(b\) Data Transformation Components
Sc:minmaxSc:stdAug:noiseAug:mixAug:noneImb:ADAImb:RUSImb:SMOTEImb:TomekImb:noneSc:minmax0\.0438Sc:std0\.0438Aug:noise0\.10450\.1028Aug:mix0\.10450\.0279Aug:none0\.10280\.0279Imb:ADA0\.04350\.03830\.06480\.0672Imb:RUS0\.04350\.03950\.05670\.0595Imb:SMOTE0\.03830\.03950\.05480\.0571Imb:Tomek0\.06480\.05670\.05480\.0325Imb:none0\.06720\.05950\.05710\.0325
\(c\) Model and Decision Components
M:DTM:GBM:LRM:XGBM:RFM:SVMNorm:firstNorm:lastThr:\.35Thr:\.5Split:\.1Split:\.2M:DT0\.07990\.09700\.07920\.07360\.0992M:GB0\.07990\.05100\.03190\.04220\.0510M:LR0\.09700\.05100\.05230\.05810\.0384M:XGB0\.07920\.03190\.05230\.04190\.0514M:RF0\.07360\.04220\.05810\.04190\.0596M:SVM0\.09920\.05100\.03840\.05140\.0596Norm:first0\.0349Norm:last0\.0349Thr:\.350\.0566Thr:\.50\.0566Split:\.10\.0472Split:\.20\.0472
### Supplementary ECross\-Component Interaction Analysis
This subsection examinesinteractions between pipeline componentsby analyzing correlation patterns across branches and components in the evaluated pipelines\. Given the large number of possible components, we focus on selected subsets to highlight the most informative interactions\. Specifically, we analyze:
- •Top\-10 components by mean pairwise correlation, representing consistently interacting components\.
- •Top\-10 components with the highest standard deviation of correlation, representing components with highly variable interactions across configurations\.
- •Clustered groups of components derived from the full correlation matrices, indicating sets of components that tend to co\-vary functionally\.
For the Pima dataset, Figure[Supplementary E\.9](https://arxiv.org/html/2605.21528#Ax1.F9)shows general correlation patterns among key components, while Figure[Supplementary E\.10](https://arxiv.org/html/2605.21528#Ax1.F10)highlights the most variable interactions \(highest standard deviation\)\. Figure[Supplementary E\.11](https://arxiv.org/html/2605.21528#Ax1.F11)shows the clustered part\-to\-part correlations for the top\-10 high\-variance components in the Pima dataset\.
For the Stroke dataset, Figure[Supplementary E\.12](https://arxiv.org/html/2605.21528#Ax1.F12)shows general correlation patterns among key components, while Figure[Supplementary E\.13](https://arxiv.org/html/2605.21528#Ax1.F13)displays the top\-10 highest\-standard\-deviation correlations\. Figure[Supplementary E\.14](https://arxiv.org/html/2605.21528#Ax1.F14)shows the clustered part\-to\-part correlations for the top\-10 high\-variance components in the Stroke dataset\.
These analyses reveal groups of preprocessing, augmentation, normalization, or modeling steps that tend to co\-vary, providing insights into synergistic or redundant pipeline configurations\. By focusing on the top or highly variable components, we reduce complexity while highlighting the most influential interactions\. This allows readers to understand how combinations of components collectively impact predictive performance, guiding future pipeline optimization strategies\.
Figure Supplementary E\.9:Pima dataset: Top\-10 part\-to\-part correlation heat map\.Figure Supplementary E\.10:Pima dataset: Top\-10 highest\-standard\-deviation part\-to\-part correlation heat map\.Figure Supplementary E\.11:pima cluster10 part vs part heat mapFigure Supplementary E\.12:Stroke dataset: Top\-10 part\-to\-part correlation heat map\.Figure Supplementary E\.13:Stroke dataset: Top\-10 highest\-standard\-deviation part\-to\-part correlation heat map\.Figure Supplementary E\.14:stroke cluster part vs part heat map
### Supplementary FCross\-Seed Robustness Analysis
This subsection evaluatescross\-seed variability and model robustnessacross multiple random initializations\.
Table[Supplementary F\.8](https://arxiv.org/html/2605.21528#Ax1.T8)reports Friedman test results across evaluation metrics computed over random seeds\.
Figure[Supplementary F\.15](https://arxiv.org/html/2605.21528#Ax1.F15)presents aCritical Difference \(CD\) diagrambased on Macro F1\-score, highlighting statistically significant differences between models under a class\-imbalanced evaluation setting\.
Figure[Supplementary F\.16](https://arxiv.org/html/2605.21528#Ax1.F16)shows cross\-seed mean performance and standard deviation across random seeds\.
Overall, these results provide a concise view ofstability, robustness, and reproducibility, supporting reliable pipeline selection\.
Table Supplementary F\.8:Friedman test results across evaluation metrics \(computed over random seeds\)\.Metricχ2\\chi^\{2\}pp\-valueCDAccuracy27\.813\.97×10−53\.97\\times 10^\{\-5\}2\.53Macro Precision27\.903\.80×10−53\.80\\times 10^\{\-5\}2\.53Macro Recall27\.055\.58×10−55\.58\\times 10^\{\-5\}2\.53Macro F127\.813\.97×10−53\.97\\times 10^\{\-5\}2\.53Weighted Precision27\.055\.58×10−55\.58\\times 10^\{\-5\}2\.53Weighted Recall27\.813\.97×10−53\.97\\times 10^\{\-5\}2\.53Weighted F127\.813\.97×10−53\.97\\times 10^\{\-5\}2\.53Micro Precision27\.813\.97×10−53\.97\\times 10^\{\-5\}2\.53Micro Recall27\.813\.97×10−53\.97\\times 10^\{\-5\}2\.53Micro F127\.813\.97×10−53\.97\\times 10^\{\-5\}2\.53Integrated Score26\.387\.53×10−57\.53\\times 10^\{\-5\}2\.53Figure Supplementary F\.15:Critical Difference \(CD\) diagram for model ranks across random seeds based on Macro F1\-score\. Models connected by a horizontal bar are not significantly different according to the Friedman test and CD threshold\.Figure Supplementary F\.16:Cross\-seed mean performance and standard deviation across random seeds\.Similar Articles
LLMs for Cardiovascular Risk Prediction from Structured Clinical Data
This paper presents a hybrid framework that combines structured clinical data with LLM-generated narratives for coronary artery disease prediction, achieving high fidelity in variable extraction and comparing ML models with LLM-based zero-shot and few-shot classification.
A Multi-Domain Red Teaming Framework for Safety, Robustness, and Fairness Evaluation of Medical Large Language Models
This paper presents a multi-domain red teaming framework for evaluating safety, robustness, and fairness of medical LLMs across 690 clinically grounded scenarios. Results show that high aggregate accuracy can mask critical failures, and hybrid evaluation with clinician oversight is necessary for credible safety assessment.
Calibration, Uncertainty Communication, and Deployment Readiness in CKD Risk Prediction: A Framework Evaluation Study
This study evaluates five machine learning classifiers for chronic kidney disease risk prediction, finding that near-perfect internal performance fails under distribution shift. It emphasizes the need for calibration stability and conformal coverage transfer before clinical deployment.
Resolving the bias-precision paradox with stochastic causal representation learning for personalized medicine
This paper introduces a stochastic causal representation learning framework to resolve the bias-precision paradox in personalized medicine, demonstrating improved accuracy and interpretability in ICU clinical decision support.
Surrogate modeling for interpreting black-box LLMs in medical predictions
Researchers propose a surrogate modeling framework to quantify and interpret latent medical knowledge encoded in black-box LLMs, revealing both valid associations and persistent racial biases.