Customer Churn Prediction on Structured Data Using FT-Transformer and Stacking Ensembles
Summary
This paper presents a hybrid architecture combining FT-Transformer with gradient-boosted trees via calibration-aware stacking for customer churn prediction on structured tabular data, achieving improved F1 and AUC-ROC on a public bank churn dataset.
View Cached Full Text
Cached at: 06/09/26, 08:47 AM
# Customer Churn Prediction on Structured Data Using FT-Transformer and Stacking Ensembles
Source: [https://arxiv.org/html/2606.07582](https://arxiv.org/html/2606.07582)
\\history
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000\.[https://arxiv.org/doi.org/](https://arxiv.org/doi.org/)
\\corresp
Corresponding author: Joyjit Roy \(e\-mail: joyjit\.roy\.tech@gmail\.com\)\.
SAMARESH KUMAR SINGH2and LAXMI SHAW3Independent Researcher, Austin, TX, USA \(e\-mail: joyjit\.roy\.tech@gmail\.com\)Independent Researcher, Leander, TX, \(e\-mail: ssam3003@gmail\.com\)Texas A & M University\-Victoria, Victoria, TX, \(e\-mail: laxmishaw1983@gmail\.com\)
###### Abstract
Customer churn predictionis essential across data\-driven industries such as insurance, digital banking, e\-commerce, and subscription platforms, where retaining existing customers is typically more cost\-effective than acquiring new ones\. Predicting churn on structured tabular datasets remains challenging due to class imbalance, nonlinear feature interactions, and heterogeneous feature types\. Tree\-based ensemble methods consistently demonstrate strong performance in these contexts, often outperforming conventional neural networks\. This study introduces a validated hybrid architecture that integrates feature\-tokenized transformers \(FT\-Transformer\) with gradient\-boosted trees through calibration\-aware stacking\. The proposed framework addresses persistent gaps in statistical validation, probability calibration, and reproducibility found in prior research\. The FT\-Transformer captures higher\-order feature interactions using self\-attention, while XGBoost captures gradient\-boosted decision boundaries with complementary inductive biases\. Class imbalance is handled through class\-weighted loss functions, avoiding synthetic oversampling and preserving minority class distributions\. The models are ensembled using out\-of\-fold \(OOF\) stacking with a logistic regression meta\-learner, which recalibrates overconfident base model outputs and learns optimal combination weights\. On a public bank churn dataset \(10,000 customers, 20% churn rate\), the hybrid model achieves 62\.10% F1, 0\.861 AUC\-ROC, and 0\.647 PR\-AUC, outperforming the Multi\-Layer Perceptron \(MLP\) baseline by 3\.37 F1 points \(p<<0\.001\) and 0\.027 AUC under 5×\\times5 cross\-validation with 95% confidence intervals reported\. Ablation studies demonstrate that both the transformer component and stacking strategy contribute materially to performance\. The proposed methodology offers a reproducible and extensible reference architecture for contemporary churn prediction on structured tabular data, bridging recent advances in attention\-based modeling with ensemble techniques\.
###### Index Terms:
Customer churn prediction, FT\-Transformer, gradient boosting, stacking ensemble, tabular data, class imbalance, probability calibration, reproducible machine learning
## IIntroduction
### I\-ABackground and Business Motivation
Customer churn refers to customers discontinuing their engagement or account with a business, typically by cancelling a service or closing an account\. It remains one of the most severe financial issues for organizations across banking, insurance, eCommerce, telecommunications, and subscription\-based services\[reichheld1990\]\. Studies indicate that a company may need to spend 5 to 25 times as much to acquire a new customer as to retain an existing one\[neslin2006\]\. Customer turnover rates in sectors such as telecommunications and subscription\-based services are reported to range between 20–40% annually\[burez2009\]\.
The financial effects of churn after even a slight change are drastic\. For example, a hypothetical medium\-sized bank with 500,000 customers and an average customer lifetime value \(CLV\) of $2,000 would preserve approximately $10 million per year with a 1% reduction in churn\. A fictitious eCommerce platform with 2 million active subscribers and a $50 average CLV would retain roughly $1 million under the same conditions\. A regional insurance agency with 100,000 policyholders and an $800 average CLV would preserve around $800,000 annually\. Accurate churn prognosis enables companies to apply targeted retention techniques, allocate marketing budgets more effectively, and manage customer lifetime value across operational sectors\.
This paper addresses the challenge of improving churn prediction accuracy through a hybrid modeling approach that integrates transformer\-based feature learning with gradient\-boosted decision trees, evaluated using rigorous statistical validation\.
### I\-BTechnical Challenges
Addressing this problem effectively requires navigating several technical challenges that distinguish churn prediction from standard classification tasks\. Key challenges include:
1. 1\.Class Imbalance: Churners typically represent only 15–25% of the customer base\. Models trained without imbalance\-aware methods tend to favor the majority class, making accuracy metrics misleading\. For example, a classifier predicting ”no churn” for every customer may show 80% accuracy yet deliver no actionable business value\.
2. 2\.Complex Feature Interactions: Behavioral, demographic, transactional, and engagement variables interact in nonlinear ways\. For instance, the churn risk associated with a low account balance may depend on engagement level, creating interaction effects that linear or additive models fail to capture\.
3. 3\.Heterogeneous Features: Structured tabular datasets combine numerical and categorical variables\. They lack the spatial or sequential organization present in images or text\. As a result, standard deep learning architectures designed for such structured data often fail to generalize, making it harder to exploit inductive biases\.
4. 4\.Limited Dataset Sizes: Churn datasets typically range from a few thousand to several hundred thousand records, substantially smaller than those used in computer vision or NLP\. This scale increases the risk of overfitting and limits the applicability of data\-hungry deep learning models\.
### I\-CEvolution of Churn Prediction Approaches
Traditional churn prediction methods have relied on logistic regression and tree\-based ensembles\.XGBoost\[chen2016xgboost\]emerged as a leading approach due to its ability to model nonlinear interactions\. More recently, transformer architectures, such asFT\-Transformer\[gorishniy2021\], have introduced self\-attention mechanisms to capture feature dependencies in tabular data\. Despite these advancements, tree\-based models remain competitive in tabular benchmarks\[grinsztajn2022\]\. To the best of our knowledge, prior work has not systematically combined these two model families with rigorous statistical validation and probability calibration analysis, leaving a methodological gap in the churn prediction literature\.
### I\-DResearch Gap
Progress in both tree\-based ensembles and Transformer\-based structured models has been substantial, yet several methodological gaps remain:
1. 1\.Limited Hybrid Approaches: Very few studies explore hybrid architectures that combine tree ensembles with Transformer\-based models, despite their inductive biases being complementary\.
2. 2\.Insufficient Ablation Studies: Many works report final performance without isolating the contributions of individual architectural components\.
3. 3\.Class Imbalance Treatment: Several techniques rely heavily on oversampling methods, which can introduce synthetic artifacts and distort minority class distributions\.
4. 4\.Reproducibility Concerns: Precise algorithmic descriptions, preprocessing details, and exact hyperparameter settings are often omitted, which limits reproducibility\.
Prior research has neither systematically integrated FT\-Transformer with tree\-based ensemble methods nor applied rigorous statistical validation or probability calibration analysis\. Existing studies assess these architectures independently\[huang2020,gorishniy2021,somepalli2021\]or limit stacking approaches to classical models\[xu2021\]\. Recent studies from 2023 to 2025 that combine models primarily utilize SMOTE\-based oversampling\[ahmad2023,usmanhamza2024\], omit calibration analysis, and do not report confidence intervals or effect sizes\. The current study addresses all three gaps\.
### I\-EContributions
This paper makes the following contributions:
1. 1\.Hybrid architecture for tabular churn prediction:Integrates calibration\-aware stacking with FT\-Transformer and XGBoost\. To the best of the authors’ knowledge, this is among the first to combine these components specifically for churn prediction, with systematic validation of error independence \(ρ=0\.62\\rho=0\.62\) and probability calibration \(ECE = 0\.038\)\. The proposed approach achieves an F1 score of 62\.10% and demonstrates statistically significant improvements over all baselines \(p<<0\.001\)\.
2. 2\.Comprehensive ablation analysis:Isolates the contributions of transformer layers, ensemble strategies, and meta\-learner selection under controlled experimental conditions\.
3. 3\.Probability calibration assessment:Through Expected Calibration Error analysis, demonstrates improved decision reliability for cost\-sensitive interventions\.
4. 4\.Fully reproducible implementationprovides detailed algorithmic specifications, preprocessing steps, and hyperparameter configurations to facilitate adoption and validation\.
The structure of this paper is as follows\. Section[II](https://arxiv.org/html/2606.07582#S2)reviews related work across classical, tree\-based, deep learning, and ensemble methods\. Section[III](https://arxiv.org/html/2606.07582#S3)introduces the mathematical formulation of the proposed framework\. Section[IV](https://arxiv.org/html/2606.07582#S4)describes the dataset\. Section[V](https://arxiv.org/html/2606.07582#S5)outlines the methodology, including preprocessing, model training, and stacking procedures\. Section[VI](https://arxiv.org/html/2606.07582#S6)presents experimental results, such as baseline comparisons, feature importance, and calibration analysis\. Section[VII](https://arxiv.org/html/2606.07582#S7)covers ablation and sensitivity studies\. Section[VIII](https://arxiv.org/html/2606.07582#S8)discusses the findings and their business implications\. Section[IX](https://arxiv.org/html/2606.07582#S9)examines limitations and threats to validity\. Section[X](https://arxiv.org/html/2606.07582#S10)concludes the paper, and Section[XI](https://arxiv.org/html/2606.07582#S11)outlines future research directions\.
## IIRelated Work
TABLE I:Overview of Prior Research on Churn Prediction ApproachesRelated studies fall into 4 main categories: classical statistical methods, tree\-based ensemble methods, deep learning approaches for tabular data, and ensemble or stacking strategies\.
### II\-AClassical Statistical Methods
Logistic regressionremained the dominant method in early churn studies\[neslin2006\]\. It was valued for interpretability and probabilistic outputs\. However, the linear log\-odds assumption limits its ability to capture nonlinear relationships and complex feature interactions common in modern customer datasets\.
Verbeke et al\. evaluate decision trees and rule\-based classifiers for churn prediction\[verbeke2011\]\. A follow\-up study\[verbeke2012new\]extended this analysis to telecom churn, which demonstrates that ensemble methods consistently outperform single classifiers across domain\-specific settings\.Rule\-based modelsremain interpretable while performing competitively\. Single trees still show high variance, which motivates the development of ensemble methods\. Survival analysis methods\[baesens2014\]model time\-to\-churn rather than binary outcomes but require longitudinal data with precise churn timing, which is often impossible to obtain\. These classical approaches provide interpretability but struggle with the high\-dimensional feature spaces and severe class imbalance typical of modern churn datasets\. Ensemble methods address these challenges more effectively\.
### II\-BTree\-Based Ensemble Methods
Ensemble learningcombines multiple weak learners to reduce variance and improve generalization\[dietterich2000\], marking a key shift in churn prediction methodology\.
Bagging and Random Forests: Breiman’s Random Forest method\[breiman2001\]builds an ensemble of decision trees using resampled training data and randomly selected feature partitions\. This diversification reduces tree correlation and variance while enabling feature importance estimates\. The final predictions are made by majority vote or averaging\.
Boosting Methods: Boosting trains models in sequence, with each stage emphasizing the samples or regions where earlier learners performed poorly\. AdaBoost\[freund1997\]reweights misclassified samples, while Gradient Boosting\[friedman2001\]fits new trees to loss gradients, offering lower bias than bagging\.
XGBoost: XGBoost\[chen2016xgboost\]remains a dominant boosting framework due to regularization, efficient handling of missing values, and parallel tree construction\. Xu et al\.\[xu2021\]report 98% accuracy on a telecom churn dataset using XGBoost with feature grouping and stacking\. The study notes that precision and recall are more informative than accuracy under class imbalance\.
Modern Gradient Boosting Variants: Several optimized implementations have emerged alongside XGBoost\.LightGBM\[ke2017\]introduces leaf\-wise tree growth and histogram\-based splitting for faster training on large datasets\.CatBoost\[prokhorenkova2018\]provides specialized encoding for high\-cardinality categorical features and ordered boosting to reduce overfitting\. These variants offer performance advantages for specific data characteristics\. However, XGBoost remains widely adopted due to its maturity and extensive validation across multiple domains\.
### II\-CDeep Learning for Tabular Data
The strong performance of deep learning in areas like computer vision and NLP has encouraged researchers to explore neural architectures for tabular datasets\. This transition has proven difficult in practice\. Grinsztajn et al\.\[grinsztajn2022\]compared deep learning and tree\-based methods across 45 tabular datasets and identified several factors behind the consistent strength of tree\-based models:
- •Lack of Inductive Bias: CNNs exploit spatial locality and transformers capture sequence structure, but tabular data provides no inherent structure for neural networks\.
- •Irregular Target Functions: Tabular targets exhibit sharp decision boundaries that trees capture through axis\-aligned splits, while neural networks prefer smooth functions\.
- •Feature Characteristics: Trees ignore uninformative features through selection, remain invariant to monotonic transformations, and avoid rotation sensitivity seen in MLPs\.
RecentTransformer\-Based Architecturesaddress some of these challenges\[algul2025\]\.TabTransformer\[huang2020\]applies self\-attention to categorical embeddings but does not explicitly model interactions with numerical features\.FT\-Transformer\[gorishniy2021\]tokenizes numerical and categorical features into a shared representation and applies transformer layers to model feature interactions\.SAINT\(Self\-Attention and Intersample Attention Transformer\)\[somepalli2021\]introduces row\-wise attention and contrastive pre\-training at higher computational cost\. The computational complexity of attention\-based architectures has been analyzed in broad learning contexts\[jin2024flexible,jin2022regularized\], where regularization and manifold\-based methods offer efficiency trade\-offs relevant to tabular model design\.
TabNet\(Tabular Attentive Network\)\[arik2021\]uses sequential attention for interpretable feature selection\. Recent work by Sarafian\[sarafian2025\]explores improved training procedures and architectural refinements for deep tabular learning\.
### II\-DEnsemble and Stacking Strategies
Stacking\(stacked generalization\), introduced by Wolpert\[wolpert1992\], trains a meta\-model on the outputs of several base learners to integrate their predictive signals\. Unlike simple averaging or voting, stacking learns combination weights and can model nonlinear relationships among base models\.
Theoretical Foundation: Zhou\[zhou2012\]explains that ensemble gains result from combining base learners that are both accurate and diverse in their error patterns\. Independent error assumptions lead to exponential error reduction, but these conditions rarely hold in practice\. Diversity can instead be achieved by combining models with different inductive biases, such as trees and neural networks\.
Out\-of\-Fold Predictions: To limit overfitting, meta\-learners are trained on out\-of\-fold \(OOF\) outputs generated during cross\-validation\[wolpert1992,zhou2012\]\. K\-fold cross\-validation ensures that each sample is evaluated by a model that has not encountered it during training, yielding unbiased inputs for the meta\-learner\.
Meta\-Learner Selection: Prior studies\[zhou2012,hastie2009\]employ logistic regression, Gradient Boosting Machine \(GBM\), or neural networks as meta\-learners, selected according to the complexity of the base models\. Rokach\[rokach2010\]recommends simpler meta\-learners when the number of base models is small\. Recent advances have explored transformer\-based stacking: Yang et al\.\[yang2025\]introduce a stacking strategy that incorporates transformers for multi\-model integration\.
### II\-ERecent Advances \(2022\-2025\)
Recent research has advanced transformer architectures for tabular data beyond FT\-Transformer\.TabPFN\(Tabular Prior\-data Fitted Networks\)\[hollmann2023\]applies meta\-learning for few\-shot tabular classification, and follow\-up work\[hollmann2025\]shows that tabular foundation models can perform well even with small datasets\.ExcelFormer\[chen2023\]introduces attention mechanisms for spreadsheet\-style inputs\. AutoML systems such asAutoGluon\[erickson2020\]andH2O AutoML\[ledell2020\]automate model and hyperparameter selection and often match or exceed manual tuning\.
Recent ensemble studieshave addressed class imbalance, hybrid architectures, and domain\-specific applications in churn prediction\. Usman\-Hamza et al\.\[usmanhamza2024\]propose a heterogeneous multi\-layer stacking ensemble that combines SMOTE \(Synthetic Minority Oversampling Technique\) with diverse base learners for telecom churn prediction\. This improves the detection of minority classes\. However, the reliance on SMOTE introduces synthetic artifacts, and the study omits probability calibration analysis\.
Ahmad et al\.\[ahmad2023\]implemented hybrid approaches integrating Random Forest, XGBoost, and LightGBM with SMOTE\-based class balancing\. However, statistical significance testing and calibration metrics are not reported\.
Warnakulaarachchi and Kumarapathirage\[warnakulaarachchi2025\]employ a deep ensemble method for banking churn, which illustrates domain\-specific adaptations to improve financial customer retention\. The study does not incorporate transformer\-based feature learning or conduct ablation analysis\.
FT\-Transformerwas selected for this work because it performs consistently on benchmark datasets\[gorishniy2021\], has a simple architecture suitable for ablation, and offers interpretable attention weights\. Direct benchmarking against TabTransformer\[huang2020\], SAINT\[somepalli2021\], and AutoGluon\[erickson2020\]is beyond the scope of this study and is identified as a priority direction in Section[XI](https://arxiv.org/html/2606.07582#S11)\.
### II\-FSummary and Research Gap
Table[I](https://arxiv.org/html/2606.07582#S2.T1)summarizes the major related works\. Based on the reviewed literature, prior work has addressed individual components such as tree ensembles, transformer architectures, and stacking strategies, but has not combined them systematically with calibration\-aware evaluation and rigorous ablation\. This work addresses that gap directly\.
## IIIMathematical Formulation
### III\-AProblem Definition
Let𝒟=\{\(𝐱\(i\),y\(i\)\)\}i=1N\\mathcal\{D\}=\\\{\(\\mathbf\{x\}^\{\(i\)\},y^\{\(i\)\}\)\\\}\_\{i=1\}^\{N\}be a dataset ofNNcustomer records, where𝒳\\mathcal\{X\}denotes the input feature space\. Each feature vector𝐱\(i\)=\[x1\(i\),x2\(i\),…,xm\(i\)\]∈𝒳\\mathbf\{x\}^\{\(i\)\}=\[x\_\{1\}^\{\(i\)\},x\_\{2\}^\{\(i\)\},\\ldots,x\_\{m\}^\{\(i\)\}\]\\in\\mathcal\{X\}containsmmfeatures partitioned into numerical features𝐱num∈ℝmnum\\mathbf\{x\}\_\{num\}\\in\\mathbb\{R\}^\{m\_\{num\}\}and categorical features𝐱cat∈𝒞1×⋯×𝒞mcat\\mathbf\{x\}\_\{cat\}\\in\\mathcal\{C\}\_\{1\}\\times\\cdots\\times\\mathcal\{C\}\_\{m\_\{cat\}\}, where𝒞j\\mathcal\{C\}\_\{j\}is the set of categories for featurejj\. The binary labely\(i\)∈\{0,1\}y^\{\(i\)\}\\in\\\{0,1\\\}indicates churn \(y=1y=1\) or retention \(y=0y=0\)\.
The goal is to learn a functionf:𝒳→\[0,1\]f:\\mathcal\{X\}\\rightarrow\[0,1\]that outputs the probabilityp^\(i\)=f\(𝐱\(i\)\)=P\(Y=1\|𝐗=𝐱\(i\)\)\\hat\{p\}^\{\(i\)\}=f\(\\mathbf\{x\}^\{\(i\)\}\)=P\(Y=1\|\\mathbf\{X\}=\\mathbf\{x\}^\{\(i\)\}\)that customeriiwill churn\. The final prediction isy^\(i\)=𝟙\[p^\(i\)\>τ\]\\hat\{y\}^\{\(i\)\}=\\mathbb\{1\}\[\\hat\{p\}^\{\(i\)\}\>\\tau\], where𝟙\[⋅\]\\mathbb\{1\}\[\\cdot\]is the indicator function that returns 1 if the condition holds and 0 otherwise, andτ\\tauis the classification threshold \(typically 0\.5\)\.
### III\-BClass\-Weighted Binary Cross\-Entropy Loss
Standard binary cross\-entropy \(BCE\) loss treats all samples equally:
ℒBCE\(p^,y\)=−\[ylog\(p^\)\+\(1−y\)log\(1−p^\)\]\\mathcal\{L\}\_\{BCE\}\(\\hat\{p\},y\)=\-\[y\\log\(\\hat\{p\}\)\+\(1\-y\)\\log\(1\-\\hat\{p\}\)\]\(1\)
Under class imbalance \(e\.g\., 80% non\-churners, 20% churners\), this loss is dominated by the majority class\. Class weights are applied to address imbalance:
ℒweighted\(p^,y\)=−\[w\+⋅y⋅log\(p^\)\+w−⋅\(1−y\)⋅log\(1−p^\)\]\\mathcal\{L\}\_\{weighted\}\(\\hat\{p\},y\)=\-\[w\_\{\+\}\\cdot y\\cdot\\log\(\\hat\{p\}\)\+w\_\{\-\}\\cdot\(1\-y\)\\cdot\\log\(1\-\\hat\{p\}\)\]\(2\)
wherew\+w\_\{\+\}andw−w\_\{\-\}are weights for positive \(churn\) and negative \(non\-churn\) classes\. Weights are assigned inversely proportional to class frequencies:
w\+=N2⋅N\+,w−=N2⋅N−w\_\{\+\}=\\frac\{N\}\{2\\cdot N\_\{\+\}\},\\quad w\_\{\-\}=\\frac\{N\}\{2\\cdot N\_\{\-\}\}\(3\)
whereN\+N\_\{\+\}andN−N\_\{\-\}are counts of positive and negative samples\. For the dataset used here withN\+=2,037N\_\{\+\}=2,037andN−=7,963N\_\{\-\}=7,963, this yieldsw\+≈2\.45w\_\{\+\}\\approx 2\.45andw−≈0\.63w\_\{\-\}\\approx 0\.63, effectively upweighting churner samples by approximately 4×\\times\.
The total loss over a mini\-batchℬ\\mathcal\{B\}is:
ℒtotal=1\|ℬ\|∑\(𝐱,y\)∈ℬℒweighted\(f\(𝐱\),y\)\\mathcal\{L\}\_\{total\}=\\frac\{1\}\{\|\\mathcal\{B\}\|\}\\sum\_\{\(\\mathbf\{x\},y\)\\in\\mathcal\{B\}\}\\mathcal\{L\}\_\{weighted\}\(f\(\\mathbf\{x\}\),y\)\(4\)
### III\-CFT\-Transformer Architecture
The FT\-Transformer\[gorishniy2021\]transforms heterogeneous tabular features into a sequence of embeddings and applies transformer layers to model feature interactions\.
#### III\-C1Feature Tokenization
Each input featurexjx\_\{j\}is embedded into add\-dimensional vector:
Numerical Features: For numerical featurexj∈ℝx\_\{j\}\\in\\mathbb\{R\}:
𝐞j=xj⋅𝐰j\+𝐛j\\mathbf\{e\}\_\{j\}=x\_\{j\}\\cdot\\mathbf\{w\}\_\{j\}\+\\mathbf\{b\}\_\{j\}\(5\)where𝐰j∈ℝd\\mathbf\{w\}\_\{j\}\\in\\mathbb\{R\}^\{d\}and𝐛j∈ℝd\\mathbf\{b\}\_\{j\}\\in\\mathbb\{R\}^\{d\}are learnable parameters\. This linear embedding allows the model to learn feature\-specific scaling and shifting\.
Categorical Features: For categorical featurexj∈\{1,…,Cj\}x\_\{j\}\\in\\\{1,\\ldots,C\_\{j\}\\\}:
𝐞j=𝐄j\[xj\]\\mathbf\{e\}\_\{j\}=\\mathbf\{E\}\_\{j\}\[x\_\{j\}\]\(6\)where𝐄j∈ℝCj×d\\mathbf\{E\}\_\{j\}\\in\\mathbb\{R\}^\{C\_\{j\}\\times d\}is a learnable embedding matrix\.
This produces an embedding matrix𝐄=\[𝐞1,𝐞2,…,𝐞m\]∈ℝm×d\\mathbf\{E\}=\[\\mathbf\{e\}\_\{1\},\\mathbf\{e\}\_\{2\},\\ldots,\\mathbf\{e\}\_\{m\}\]\\in\\mathbb\{R\}^\{m\\times d\}\.
#### III\-C2Classification Token
A learnable \[CLS\] token𝐞CLS∈ℝd\\mathbf\{e\}\_\{CLS\}\\in\\mathbb\{R\}^\{d\}is prepended to aggregate information:
𝐄′=\[𝐞CLS,𝐞1,…,𝐞m\]∈ℝ\(m\+1\)×d\\mathbf\{E\}^\{\\prime\}=\[\\mathbf\{e\}\_\{CLS\},\\mathbf\{e\}\_\{1\},\\ldots,\\mathbf\{e\}\_\{m\}\]\\in\\mathbb\{R\}^\{\(m\+1\)\\times d\}\(7\)
#### III\-C3Multi\-Head Self\-Attention
The multi\-head self\-attention mechanism is the core of the transformer architecture\. For input𝐙∈ℝn×d\\mathbf\{Z\}\\in\\mathbb\{R\}^\{n\\times d\}\(wheren=m\+1n=m\+1\), queries, keys, and values are computed as:
𝐐=𝐙𝐖Q,𝐊=𝐙𝐖K,𝐕=𝐙𝐖V\\mathbf\{Q\}=\\mathbf\{Z\}\\mathbf\{W\}^\{Q\},\\quad\\mathbf\{K\}=\\mathbf\{Z\}\\mathbf\{W\}^\{K\},\\quad\\mathbf\{V\}=\\mathbf\{Z\}\\mathbf\{W\}^\{V\}\(8\)where𝐖Q,𝐖K,𝐖V∈ℝd×dk\\mathbf\{W\}^\{Q\},\\mathbf\{W\}^\{K\},\\mathbf\{W\}^\{V\}\\in\\mathbb\{R\}^\{d\\times d\_\{k\}\}\.
Attention weights are computed via scaled dot\-product:
𝐀=softmax\(𝐐𝐊Tdk\)∈ℝn×n\\mathbf\{A\}=\\text\{softmax\}\\left\(\\frac\{\\mathbf\{Q\}\\mathbf\{K\}^\{T\}\}\{\\sqrt\{d\_\{k\}\}\}\\right\)\\in\\mathbb\{R\}^\{n\\times n\}\(9\)
The scaling factordk\\sqrt\{d\_\{k\}\}prevents attention weights from becoming too peaked whendkd\_\{k\}is large, which would cause vanishing gradients through the softmax\.
The attention output is:
Attention\(𝐐,𝐊,𝐕\)=𝐀𝐕\\text\{Attention\}\(\\mathbf\{Q\},\\mathbf\{K\},\\mathbf\{V\}\)=\\mathbf\{A\}\\mathbf\{V\}\(10\)
For multi\-head attention withHHheads:
MultiHead\(𝐙\)=Concat\(head1,…,headH\)𝐖O\\text\{MultiHead\}\(\\mathbf\{Z\}\)=\\text\{Concat\}\(\\text\{head\}\_\{1\},\\ldots,\\text\{head\}\_\{H\}\)\\mathbf\{W\}^\{O\}\(11\)whereheadh=Attention\(𝐙𝐖hQ,𝐙𝐖hK,𝐙𝐖hV\)\\text\{head\}\_\{h\}=\\text\{Attention\}\(\\mathbf\{Z\}\\mathbf\{W\}^\{Q\}\_\{h\},\\mathbf\{Z\}\\mathbf\{W\}^\{K\}\_\{h\},\\mathbf\{Z\}\\mathbf\{W\}^\{V\}\_\{h\}\)and𝐖O∈ℝH⋅dk×d\\mathbf\{W\}^\{O\}\\in\\mathbb\{R\}^\{H\\cdot d\_\{k\}\\times d\}\.
Interpretation: The attention matrix𝐀\\mathbf\{A\}captures pairwise feature interactions\. EntryAijA\_\{ij\}represents how much featureii“attends to” featurejj\. For churn prediction, this might learn that for high\-balance customers, the model should attend strongly to IsActiveMember, while for low\-balance customers, CreditScore receives more attention\.
#### III\-C4Transformer Layer
Each ofLLtransformer layers applies:
𝐙′\\displaystyle\\mathbf\{Z\}^\{\\prime\}=LayerNorm\(𝐙\+MultiHead\(𝐙\)\)\\displaystyle=\\text\{LayerNorm\}\(\\mathbf\{Z\}\+\\text\{MultiHead\}\(\\mathbf\{Z\}\)\)\(12\)𝐙′′\\displaystyle\\mathbf\{Z\}^\{\\prime\\prime\}=LayerNorm\(𝐙′\+FFN\(𝐙′\)\)\\displaystyle=\\text\{LayerNorm\}\(\\mathbf\{Z\}^\{\\prime\}\+\\text\{FFN\}\(\\mathbf\{Z\}^\{\\prime\}\)\)\(13\)where LayerNorm\[ba2016layernorm\]normalizes activations across features\.
The feed\-forward network is:
FFN\(𝐳\)=GELU\(𝐳𝐖1\+𝐛1\)𝐖2\+𝐛2\\text\{FFN\}\(\\mathbf\{z\}\)=\\text\{GELU\}\(\\mathbf\{z\}\\mathbf\{W\}\_\{1\}\+\\mathbf\{b\}\_\{1\}\)\\mathbf\{W\}\_\{2\}\+\\mathbf\{b\}\_\{2\}\(14\)
with𝐖1∈ℝd×4d\\mathbf\{W\}\_\{1\}\\in\\mathbb\{R\}^\{d\\times 4d\},𝐖2∈ℝ4d×d\\mathbf\{W\}\_\{2\}\\in\\mathbb\{R\}^\{4d\\times d\}\(expansion factor of 4\), where GELU is the Gaussian Error Linear Unit activation function\[hendrycks2016gelu\]\.
#### III\-C5Classification Head
AfterLLlayers, the \[CLS\] token representation𝐡CLS∈ℝd\\mathbf\{h\}\_\{CLS\}\\in\\mathbb\{R\}^\{d\}is passed through a classification head:
p^FT=σ\(𝐰outT𝐡CLS\+bout\)\\hat\{p\}\_\{FT\}=\\sigma\(\\mathbf\{w\}\_\{out\}^\{T\}\\mathbf\{h\}\_\{CLS\}\+b\_\{out\}\)\(15\)whereσ\\sigmais the sigmoid function\.
### III\-DXGBoost Gradient Boosting
XGBoost\[chen2016xgboost\]builds an ensemble ofTTregression trees:
s^=∑t=1Tft\(𝐱\),ft∈ℱ\\hat\{s\}=\\sum\_\{t=1\}^\{T\}f\_\{t\}\(\\mathbf\{x\}\),\\quad f\_\{t\}\\in\\mathcal\{F\}\(16\)whereℱ\\mathcal\{F\}is the space of regression trees\.
#### III\-D1Objective Function
The regularized objective at iterationttis:
ℒ\(t\)=∑i=1Nℓ\(y\(i\),y^t−1\(i\)\+ft\(𝐱\(i\)\)\)\+Ω\(ft\)\\mathcal\{L\}^\{\(t\)\}=\\sum\_\{i=1\}^\{N\}\\ell\(y^\{\(i\)\},\\hat\{y\}^\{\(i\)\}\_\{t\-1\}\+f\_\{t\}\(\\mathbf\{x\}^\{\(i\)\}\)\)\+\\Omega\(f\_\{t\}\)\(17\)
whereℓ\\elldenotes the loss function \(logistic loss in this classification setting\) andΩ\\Omegarepresents the regularization component:
Ω\(f\)=γJ\+12λ∑j=1Jwj2\\Omega\(f\)=\\gamma J\+\\frac\{1\}\{2\}\\lambda\\sum\_\{j=1\}^\{J\}w\_\{j\}^\{2\}\(18\)
Here,JJindicates the total number of leaves,wjw\_\{j\}denotes the weight of leafjj,γ\\gammacontrols the tree complexity, andλ\\lambdaapplies L2 regularization to the leaf weights\.
#### III\-D2Second\-Order Approximation
XGBoost uses a second\-order Taylor expansion:
ℒ\(t\)≈∑i=1N\[gift\(𝐱\(i\)\)\+12hift\(𝐱\(i\)\)2\]\+Ω\(ft\)\\mathcal\{L\}^\{\(t\)\}\\approx\\sum\_\{i=1\}^\{N\}\[g\_\{i\}f\_\{t\}\(\\mathbf\{x\}^\{\(i\)\}\)\+\\frac\{1\}\{2\}h\_\{i\}f\_\{t\}\(\\mathbf\{x\}^\{\(i\)\}\)^\{2\}\]\+\\Omega\(f\_\{t\}\)\(19\)
wheregi=∂y^ℓ\(y\(i\),y^t−1\(i\)\)g\_\{i\}=\\partial\_\{\\hat\{y\}\}\\ell\(y^\{\(i\)\},\\hat\{y\}^\{\(i\)\}\_\{t\-1\}\)andhi=∂y^2ℓ\(y\(i\),y^t−1\(i\)\)h\_\{i\}=\\partial^\{2\}\_\{\\hat\{y\}\}\\ell\(y^\{\(i\)\},\\hat\{y\}^\{\(i\)\}\_\{t\-1\}\)are first and second derivatives\.
For binary classification with logistic loss:
gi=p^i−yi,hi=p^i\(1−p^i\)g\_\{i\}=\\hat\{p\}\_\{i\}\-y\_\{i\},\\quad h\_\{i\}=\\hat\{p\}\_\{i\}\(1\-\\hat\{p\}\_\{i\}\)\(20\)
#### III\-D3Optimal Leaf Weights
For a fixed tree structure, the optimal weight for leafjjcontaining samplesIjI\_\{j\}\(the index set of samples assigned to leafjj\) is:
wj∗=−∑i∈Ijgi∑i∈Ijhi\+λ,λ\>0w\_\{j\}^\{\*\}=\-\\frac\{\\sum\_\{i\\in I\_\{j\}\}g\_\{i\}\}\{\\sum\_\{i\\in I\_\{j\}\}h\_\{i\}\+\\lambda\},\\quad\\lambda\>0\(21\)where the constraintλ\>0\\lambda\>0ensures numerical stability and prevents division by zero\.
#### III\-D4Class Imbalance Handling
XGBoost handles class imbalance through thescale\_pos\_weightparameter, which scales the gradient for positive samples:
giscaled=\{scale\_pos\_weight⋅giifyi=1giifyi=0g\_\{i\}^\{scaled\}=\\begin\{cases\}\\text\{scale\\\_pos\\\_weight\}\\cdot g\_\{i\}&\\text\{if \}y\_\{i\}=1\\\\ g\_\{i\}&\\text\{if \}y\_\{i\}=0\\end\{cases\}\(22\)
scale\_pos\_weightis assigned asN−/N\+≈3\.9N\_\{\-\}/N\_\{\+\}\\approx 3\.9\.
### III\-EStacking Ensemble Theory
#### III\-E1Bias\-Variance Decomposition
For a single model, expected prediction error decomposes as:
𝔼\[\(y−f^\(𝐱\)\)2\]=Bias2\+Variance\+Noise\\mathbb\{E\}\[\(y\-\\hat\{f\}\(\\mathbf\{x\}\)\)^\{2\}\]=\\text\{Bias\}^\{2\}\+\\text\{Variance\}\+\\text\{Noise\}\(23\)
Ensembling reduces variance when base models have uncorrelated errors\. ForMMmodels with equal varianceσ2\\sigma^\{2\}and pairwise correlationρ\\rho:
Varensemble=1Mσ2\+M−1Mρσ2\\text\{Var\}\_\{ensemble\}=\\frac\{1\}\{M\}\\sigma^\{2\}\+\\frac\{M\-1\}\{M\}\\rho\\sigma^\{2\}\(24\)
Whenρ<1\\rho<1\(diverse models\), ensemble variance is lower than individual variance\.
#### III\-E2Stacking Formulation
Given base model predictionsp^1,…,p^M\\hat\{p\}\_\{1\},\\ldots,\\hat\{p\}\_\{M\}, the stacking meta\-learner learns:
p^stack=g\(p^1,…,p^M;𝜽\)\\hat\{p\}\_\{stack\}=g\(\\hat\{p\}\_\{1\},\\ldots,\\hat\{p\}\_\{M\};\\boldsymbol\{\\theta\}\)\(25\)whereggis the meta\-learner function and𝜽\\boldsymbol\{\\theta\}denotes its learnable parameters\.
For logistic regression meta\-learner:
p^stack=σ\(w0\+∑k=1Mwkp^k\)\\hat\{p\}\_\{stack\}=\\sigma\\left\(w\_\{0\}\+\\sum\_\{k=1\}^\{M\}w\_\{k\}\\hat\{p\}\_\{k\}\\right\)\(26\)
The learned weightswkw\_\{k\}indicate the relative contribution of each base model\. ForM=2M=2\(FT\-Transformer and XGBoost\):
p^stack=σ\(w0\+w1p^FT\+w2p^XGB\)\\hat\{p\}\_\{stack\}=\\sigma\(w\_\{0\}\+w\_\{1\}\\hat\{p\}\_\{FT\}\+w\_\{2\}\\hat\{p\}\_\{XGB\}\)\(27\)
#### III\-E3Out\-of\-Fold Prediction
To prevent information leakage,KK\-fold cross\-validation is used to generate meta\-features\. For foldkk:
1. 1\.Train base models on folds\{1,…,K\}∖\{k\}\\\{1,\\ldots,K\\\}\\setminus\\\{k\\\}
2. 2\.Generate predictions for foldkksamples
This produces out\-of\-fold predictions𝐏OOF∈ℝN×M\\mathbf\{P\}\_\{OOF\}\\in\\mathbb\{R\}^\{N\\times M\}where each sample’s prediction comes from a model that did not see it during training\.
Consistent notational conventions are adopted throughout the paper\. All notation, including symbols and conventions, is summarized in Appendix[A](https://arxiv.org/html/2606.07582#A1)\.
## IVDataset Description
### IV\-AData Source and Overview
The framework is evaluated on the Bank Customer Churn dataset\[kaggle2018\], a public benchmark containing 10,000 customer entries from a European banking institution\. This dataset is suitable for transformer evaluation due to its moderate size, realistic class imbalance \(20\.4% churn\), and established use in churn prediction research\. Each record contains demographic information, account characteristics, and engagement indicators, along with a binary target indicating whether the customer churned within the observation period\.
### IV\-BFeature Description
Table[II](https://arxiv.org/html/2606.07582#S4.T2)lists the dataset features with descriptive statistics\. The feature composition is comparable to datasets used in prior churn studies\[burez2009,xu2021\], enabling alignment with existing methodologies\.
TABLE II:Dataset Features with Descriptive Statistics
### IV\-CClass Distribution and Imbalance
The dataset exhibits class imbalance typical of churn scenarios:
- •Non\-churners \(Retained\): 7,963 customers \(79\.6%\)
- •Churners \(Exited\): 2,037 customers \(20\.4%\)
The approximately 4:1 imbalance ratio requires using class\-weighted loss functions rather than relying solely on naive accuracy optimization\.
### IV\-DFeature Distributions by Class
Exploratory analysis reveals notable differences between churners and non\-churners:
- •Age: Churners are older on average \(45 vs 37 years\)
- •Balance: Churners have higher average balance \($91K vs $72K\)
- •NumOfProducts: Churners more likely to have 3\-4 products
- •IsActiveMember: Churners less likely to be active \(36% vs 55%\)
- •Geography: Churn is highest in Germany \(32%\) compared to France \(16%\) and Spain \(17%\)
### IV\-ECross\-Domain Feature Mapping
Table[III](https://arxiv.org/html/2606.07582#S4.T3)illustrates the correspondence between banking features and analogous features in other industries\. This mapping reinforces the framework’s domain\-agnostic design, as feature types such as demographic attributes, engagement metrics, financial indicators, and product holdings are conceptually similar across banking, telecommunications, e\-commerce, and insurance\. Only domain\-specific value ranges and encodings differ\.
TABLE III:Cross\-Domain Feature Mapping
### IV\-FData Quality and Limitations
The dataset\[kaggle2018\]is publicly available but lacks provenance details, including the source institution and sampling methodology\. It represents a single snapshot from a European bank, likely between 2010–2015, with no temporal progression or behavioral trends\. Geographic scope is limited to France, Germany, and Spain, and no missing values are present, suggesting prior preprocessing\.
Despite these constraints, the dataset is widely used as a benchmark for churn prediction on tabular data and provides a representative class imbalance \(20\.4%\), making it suitable for method comparison and reproducibility\.
## VMethodology
### V\-AOverall Architecture
Figure[1](https://arxiv.org/html/2606.07582#S5.F1)illustrates the end\-to\-end framework\. Raw customer data passes through preprocessing \(imputation, normalization, and encoding\) before being forwarded to two independent base models, FT\-Transformer and XGBoost\. Out\-of\-fold predictions from both models form the inputs to a logistic regression meta\-learner, which outputs the final churn probability\[wolpert1992\]\.
Figure 1:Overall architecture of the FT\-Transformer \+ XGBoost stacking ensemble framework\.
### V\-BData Preprocessing Pipeline
Algorithm[1](https://arxiv.org/html/2606.07582#alg1)details the preprocessing pipeline\.
Algorithm 1Data Preprocessing Pipeline1:Raw dataset
𝒟raw\\mathcal\{D\}\_\{raw\}
2:Preprocessed dataset
𝒟\\mathcal\{D\}
3:// Step 1: Data Cleaning
4:Remove identifier columns: RowNumber, CustomerId, Surname
5:// Step 2: Missing Value Imputation \(fit on train only\)
6:foreach numerical feature
xjx\_\{j\}do
7:
xjmissing←median\(xjtrain\)x\_\{j\}^\{\\text\{missing\}\}\\leftarrow\\mathrm\{median\}\(x\_\{j\}^\{\\text\{train\}\}\)⊳\\trianglerightCompute from train only
8:endfor
9:foreach categorical feature
xjx\_\{j\}do
10:
xjmissing←mode\(xjtrain\)x\_\{j\}^\{\\text\{missing\}\}\\leftarrow\\mathrm\{mode\}\(x\_\{j\}^\{\\text\{train\}\}\)⊳\\trianglerightCompute from train only
11:endfor
12:// Step 3: Feature Encoding \(fit on train only\)
13:foreach numerical feature
xjx\_\{j\}do
14:
μj,σj←mean\(xjtrain\),std\(xjtrain\)\\mu\_\{j\},\\sigma\_\{j\}\\leftarrow\\mathrm\{mean\}\(x\_\{j\}^\{\\text\{train\}\}\),\\mathrm\{std\}\(x\_\{j\}^\{\\text\{train\}\}\)
15:
xj←\(xj−μj\)/σjx\_\{j\}\\leftarrow\(x\_\{j\}\-\\mu\_\{j\}\)/\\sigma\_\{j\}⊳\\trianglerightZ\-score normalization
16:endfor
17:foreach categorical feature
xjx\_\{j\}do
18:Apply one\-hot encoding
19:endfor
20:// Step 4: Output
21:Geography
→\\rightarrow3 binary columns \(France, Germany, Spain\)
22:Gender
→\\rightarrow1 binary column \(IsMale\)
23:return
𝒟\\mathcal\{D\}with 12 features
\. Key design decisions are as follows:
- •Z\-score vs Min\-Max: Z\-score normalization is used to preserve outliers and improve robustness to extreme values in Balance and Estimated Salary\.
- •Fit on Train Only: Normalization parameters \(μ\\mu,σ\\sigma\) are derived solely from the training split to avoid any form of data leakage\.
- •Imputation from Train Only: Imputation statistics \(median for numerical features, mode for categorical features\) are computed exclusively from the training split and applied to both the training and test sets to mitigate data leakage\.
- •No Oversampling: SMOTE and random oversampling are avoided due to the risk of synthetic artifacts\. A class\-weighted loss is used instead\.
### V\-CFT\-Transformer Training
Algorithm[2](https://arxiv.org/html/2606.07582#alg2)describes the FT\-Transformer training procedure with early stopping\.
Algorithm 2FT\-Transformer Training with Early Stopping1:Training data
𝒟train\\mathcal\{D\}\_\{train\}, validation data
𝒟val\\mathcal\{D\}\_\{val\}
2:Hyperparameters:
L=4L=4layers,
H=8H=8heads, embedding dimension
d=32d=32
3:Trained FT\-Transformer model
fFTf\_\{FT\}
4:Initialize numerical feature embeddings
\{𝐖j,𝐛j\}\\\{\\mathbf\{W\}\_\{j\},\\mathbf\{b\}\_\{j\}\\\}
5:Initialize categorical feature embeddings
\{𝐄j\}\\\{\\mathbf\{E\}\_\{j\}\\\}
6:Initialize learnable\[CLS\]token embedding
𝐞CLS\\mathbf\{e\}\_\{CLS\}
7:Initialize
LLtransformer layers with
HHattention heads
8:Compute class weights on training split:
w\+←Ntrain2N\+trainw\_\{\+\}\\leftarrow\\frac\{N\_\{train\}\}\{2N\_\{\+\}^\{train\}\},
w−←Ntrain2N−trainw\_\{\-\}\\leftarrow\\frac\{N\_\{train\}\}\{2N\_\{\-\}^\{train\}\}⊳\\trianglerightComputed once per fold, fixed during training
9:
best\_val\_loss←∞best\\\_val\\\_loss\\leftarrow\\infty,
patience←0patience\\leftarrow 0
10:for
epoch=1epoch=1to
EmaxE\_\{max\}do
11:foreach mini\-batch
ℬ⊂𝒟train\\mathcal\{B\}\\subset\\mathcal\{D\}\_\{train\}do
12:Forward pass:
13:Tokenize batch features with a \[CLS\] prefix
14:for
l=1l=1to
LLdo
15:
𝐄←TransformerLayerl\(𝐄\)\\mathbf\{E\}\\leftarrow\\text\{TransformerLayer\}\_\{l\}\(\\mathbf\{E\}\)
16:endfor
17:Extract
𝐡CLS\\mathbf\{h\}\_\{CLS\}from final layer
18:
𝐩^←σ\(𝐖out𝐡CLS\+bout\)\\hat\{\\mathbf\{p\}\}\\leftarrow\\sigma\(\\mathbf\{W\}\_\{out\}\\mathbf\{h\}\_\{CLS\}\+b\_\{out\}\)
19:Compute loss:
ℒ←WeightedBCE\(𝐩^,𝐲,w\+,w−\)\\mathcal\{L\}\\leftarrow\\text\{WeightedBCE\}\(\\hat\{\\mathbf\{p\}\},\\mathbf\{y\},w\_\{\+\},w\_\{\-\}\)
20:Backward pass:
21:Compute gradients
∇θℒ\\nabla\_\{\\theta\}\\mathcal\{L\}
22:Update parameters using Adam optimizer\[kingma2014adam\]
23:endfor
24:
val\_loss←Evaluate\(fFT,𝒟val\)val\\\_loss\\leftarrow\\text\{Evaluate\}\(f\_\{FT\},\\mathcal\{D\}\_\{val\}\)
25:if
val\_loss<best\_val\_lossval\\\_loss<best\\\_val\\\_lossthen
26:
best\_val\_loss←val\_lossbest\\\_val\\\_loss\\leftarrow val\\\_loss
27:Save model checkpoint
28:
patience←0patience\\leftarrow 0
29:else
30:
patience←patience\+1patience\\leftarrow patience\+1
31:endif
32:if
patience≥Ppatience\\geq Pthen
33:break⊳\\trianglerightEarly stopping
34:endif
35:endfor
36:Load best model checkpointreturn
fFTf\_\{FT\}
### V\-DEmbedding Dimension Selection
The embedding dimensiond=32d=32was determined via grid search on validation data\. Table[XI](https://arxiv.org/html/2606.07582#S7.T11)presents performance acrossd∈\{16,32,64,128\}d\\in\\\{16,32,64,128\\\}\. Although d = 32 is optimal for the current 10\-feature dataset, the appropriate embedding dimension should be adjusted according to the specific characteristics of each dataset\[gorishniy2021,hollmann2022tabpfn\]\.
A practical heuristic suggestsd≈⌈4m⌉d\\approx\\lceil 4\\sqrt\{m\}\\rceilfor m features, resulting in d = 13 when m = 10\. However, empirical tuning identified d=32 as superior, suggesting that the model benefits from increased representational capacity\[gorishniy2021,hollmann2022tabpfn\]\. This may reflect complex interactions among Age, Balance, and IsActiveMember in churn prediction\.
For datasets with high\-cardinality categorical features, adaptive per\-feature embedding dimensions can improve scalability\. The strategy allocates embedding capacity proportional to categorical cardinality and information content\. Features with low cardinality \(2\-5 categories\) are assigned smaller embeddings \(d = 8\), whereas features with high cardinality \(¿20 categories\) are assigned larger embeddings \(d = 32\)\. Recent research on TabPFN\[hollmann2022tabpfn\]demonstrates meta\-learning methods that dynamically select architecture parameters based on dataset properties\.
### V\-EXGBoost Training
XGBoost is configured as shown in Algorithm[3](https://arxiv.org/html/2606.07582#alg3)\. Key hyperparameters are selected as follows:
Algorithm 3XGBoost Training with Early Stopping1:Training data
𝒟train\\mathcal\{D\}\_\{train\}, validation data
𝒟val\\mathcal\{D\}\_\{val\}
2:Trained XGBoost model
fXGBf\_\{XGB\}
3:Set hyperparameters:
4:Number of trees
T←300T\\leftarrow 300
5:Maximum tree depth
dmax←6d\_\{\\max\}\\leftarrow 6
6:Learning rate
η←0\.05\\eta\\leftarrow 0\.05
7:Subsample ratio
ρ←0\.8\\rho\\leftarrow 0\.8
8:Column subsample ratio
ρc←0\.8\\rho\_\{c\}\\leftarrow 0\.8
9:L2 regularization
λ←1\.0\\lambda\\leftarrow 1\.0
10:Class weight
scale\_pos\_weight←N−/N\+\\text\{scale\\\_pos\\\_weight\}\\leftarrow N\_\{\-\}/N\_\{\+\}
11:Evaluation metric
←\\leftarrowlog\-loss
12:Early stopping rounds
←50\\leftarrow 50
13:Initialize XGBoost with specified hyperparameters
14:Train model on
𝒟train\\mathcal\{D\}\_\{train\}with validation on
𝒟val\\mathcal\{D\}\_\{val\}
15:Stop training if validation loss does not improve for
5050roundsreturn
fXGBf\_\{XGB\}
- •max\_depth=6: Balances model complexity and overfitting; deeper trees \(8\+\) showed overfitting in preliminary experiments\.
- •learning\_rate=0\.05: Lower than default \(0\.3\) to allow more trees and smoother convergence\.
- •subsample=0\.8: Row subsampling reduces variance and prevents overfitting\.
- •scale\_pos\_weight: Set to class ratio \(≈\\approx3\.9\) to handle imbalance\.
### V\-FStacking Ensemble with OOF Predictions
Algorithm[4](https://arxiv.org/html/2606.07582#alg4)describes the complete stacking procedure\. Key design decisions are as follows:
- •Multiple Seeds: Predictions are averaged across 5 random seeds to minimize variance resulting from random initialization\.
- •Stratified Splits: Preserve class distribution in each fold\.
- •Logistic Meta\-Learner: Simple linear combination prevents overfitting with only 2 base models\.
Algorithm 4Stacking Ensemble with OOF Predictions1:Training data
𝒟train\\mathcal\{D\}\_\{train\}, validation data
2:Trained ensemble
\(fFTfinal,fXGBfinal,g\)\(f\_\{FT\}^\{final\},f\_\{XGB\}^\{final\},g\)
3:Initialize OOF prediction vectors
𝐏FT←𝟎N,𝐏XGB←𝟎N\\mathbf\{P\}\_\{FT\}\\leftarrow\\mathbf\{0\}^\{N\},\\mathbf\{P\}\_\{XGB\}\\leftarrow\\mathbf\{0\}^\{N\}⊳\\trianglerightZero initialization
4:foreach seed
s∈𝒮s\\in\\mathcal\{S\}do
5:Set all random seeds for reproducibility:
6:random\.seed\(ss\),np\.random\.seed\(ss\)
7:torch\.manual\_seed\(ss\),torch\.cuda\.manual\_seed\_all\(ss\)
8:Create stratified
KK\-fold split using seed
ss
9:for
k=1k=1to
KKdo
10:Define training set
𝒟traink\\mathcal\{D\}\_\{train\}^\{k\}and validation set
𝒟valk\\mathcal\{D\}\_\{val\}^\{k\}
11:Compute normalization statistics
μ,σ\\mu,\\sigmaon
𝒟traink\\mathcal\{D\}\_\{train\}^\{k\}only
12:Apply normalization to both
𝒟traink\\mathcal\{D\}\_\{train\}^\{k\}and
𝒟valk\\mathcal\{D\}\_\{val\}^\{k\}using these stats
13:Train base models on
𝒟traink\\mathcal\{D\}\_\{train\}^\{k\}:
14:
fFTk,s←TrainFTTransformer\(𝒟traink\)f\_\{FT\}^\{k,s\}\\leftarrow\\text\{TrainFTTransformer\}\(\\mathcal\{D\}\_\{train\}^\{k\}\)
15:⊳\\trianglerightTrain per Algorithm[2](https://arxiv.org/html/2606.07582#alg2): 4 layers, 8 heads,d=32d=32, Adam,
16:⊳\\trianglerightclass\-weighted BCE loss, early stopping \(patience=10\)
17:
fXGBk,s←TrainXGBoost\(𝒟traink\)f\_\{XGB\}^\{k,s\}\\leftarrow\\text\{TrainXGBoost\}\(\\mathcal\{D\}\_\{train\}^\{k\}\)
18:⊳\\trianglerightTrain per Algorithm[3](https://arxiv.org/html/2606.07582#alg3): max\_depth=6,η=0\.05\\eta=0\.05, 300 trees,
19:⊳\\trianglerightscale\_pos\_weight=3\.9, early stopping \(patience=50\)
20:Generate OOF predictions for
𝒟valk\\mathcal\{D\}\_\{val\}^\{k\}
21:foreach sample
i∈𝒟valki\\in\\mathcal\{D\}\_\{val\}^\{k\}do
22:
𝐏FT\[i\]←𝐏FT\[i\]\+fFTk,s\(𝐱\(i\)\)\|𝒮\|⋅K\\mathbf\{P\}\_\{FT\}\[i\]\\leftarrow\\mathbf\{P\}\_\{FT\}\[i\]\+\\frac\{f\_\{FT\}^\{k,s\}\(\\mathbf\{x\}^\{\(i\)\}\)\}\{\|\\mathcal\{S\}\|\\cdot K\}
23:
𝐏XGB\[i\]←𝐏XGB\[i\]\+fXGBk,s\(𝐱\(i\)\)\|𝒮\|⋅K\\mathbf\{P\}\_\{XGB\}\[i\]\\leftarrow\\mathbf\{P\}\_\{XGB\}\[i\]\+\\frac\{f\_\{XGB\}^\{k,s\}\(\\mathbf\{x\}^\{\(i\)\}\)\}\{\|\\mathcal\{S\}\|\\cdot K\}
24:endfor
25:endfor
26:endfor
27:Construct meta\-feature matrix
𝐙←\[𝐏FT,𝐏XGB\]\\mathbf\{Z\}\\leftarrow\[\\mathbf\{P\}\_\{FT\},\\mathbf\{P\}\_\{XGB\}\]
28:Train meta\-learner
ggusing logistic regression with class\-balanced weights
29:Train final base models on full dataset
𝒟\\mathcal\{D\}:
30:
fFTfinal←TrainFTTransformer\(𝒟\)f\_\{FT\}^\{final\}\\leftarrow\\text\{TrainFTTransformer\}\(\\mathcal\{D\}\)
31:⊳\\trianglerightSame hyperparameters as Algorithm[2](https://arxiv.org/html/2606.07582#alg2)
32:
fXGBfinal←TrainXGBoost\(𝒟\)f\_\{XGB\}^\{final\}\\leftarrow\\text\{TrainXGBoost\}\(\\mathcal\{D\}\)
33:⊳\\trianglerightSame hyperparameters as Algorithm[3](https://arxiv.org/html/2606.07582#alg3)return\(fFTfinal,fXGBfinal,g\)\(f\_\{FT\}^\{final\},f\_\{XGB\}^\{final\},g\)
### V\-GHyperparameter Summary
Table[IV](https://arxiv.org/html/2606.07582#S5.T4)summarizes all hyperparameters\.
TABLE IV:Complete Hyperparameter SpecificationComponentParameterValueFT\-TransformerTransformer layers4Attention heads8Embedding dimension32FFN expansion factor4Dropout rate0\.1OptimizerAdamLearning rate1×10−31\\times 10^\{\-3\}Batch size256Early stopping patience10 epochsXGBoostNumber of trees300Max depth6Learning rate0\.05Subsample0\.8Column subsample0\.8L2 regularization \(λ\\lambda\)1\.0scale\_pos\_weight∼\\sim3\.9Early stopping rounds50LightGBMnum\_leaves31learning\_rate0\.05feature\_fraction0\.8scale\_pos\_weight3\.9CatBoostdepth6learning\_rate0\.05l2\_leaf\_reg3\.0scale\_pos\_weight3\.9Meta\-LearnerTypeLogistic RegressionRegularizationL2 \(C=1\.0C=1\.0\)Cross\-ValidationFolds5Random seeds5StratificationYes
## VIExperimental Results
All experiments were conducted using standard open\-source libraries with fixed random seeds to ensure reproducibility\. Models are evaluated using established metrics for imbalanced classification, with formal definitions provided in Appendix[B](https://arxiv.org/html/2606.07582#A2)\.
### VI\-ABaseline Comparison
Table[V](https://arxiv.org/html/2606.07582#S6.T5)presents a performance comparison across all models with 95% confidence intervals\.
TABLE V:Performance Comparison Across Models \(Mean±\\pmStd with 95% CI\)Main Result: The stacked ensemble achieves F1 = 62\.10% \(95% CI: \[61\.65, 62\.55\]\) and AUC = 0\.861 \(95% CI: \[0\.858, 0\.864\]\), representing statistically significant improvements over all baselines with large effect sizes \(see Table[VI](https://arxiv.org/html/2606.07582#S6.T6)\)\.
### VI\-BStatistical Significance Testing
Paired t\-tests are performed comparing the stacked ensemble to each baseline across 25 independent evaluations \(5\-fold cross\-validation \(CV\)×\\times5 random seeds\), yielding 24 degrees of freedom\.
TABLE VI:Statistical Significance Tests Comparing the Stacked Ensemble Against Baseline ModelsAll improvements are statistically significant atp<0\.01p<0\.01, with effect sizes ranging from medium\-large \(d=0\.9 for FT\-Transformer\) to very large \(d=2\.5\-2\.8 for tree\-based baselines\), indicating both statistical and practical significance\. These results should be interpreted as repeated\-resampling comparisons rather than fully independent trials\.
### VI\-CConfusion Matrix Analysis
Figure[2](https://arxiv.org/html/2606.07582#S6.F2)illustrates the confusion matrix of the stacked ensemble, computed over 5\-fold cross\-validation\.
Figure 2:Confusion matrix for the stacked ensemble aggregated over 5\-fold cross\-validation\.Recall, precision, specificity, and Negative Predictive Value \(NPV\) are computed as defined in Appendix[B](https://arxiv.org/html/2606.07582#A2)\. From the confusion matrix, the stacked model achieves 75\.4%, 53\.2%, 86\.5%, and 93\.2%, respectively\.
### VI\-DLearning Curves
Fig\.[3](https://arxiv.org/html/2606.07582#S6.F3)illustrates the FT\-Transformer’s training and validation loss curves across epochs\.
Figure 3:FT\-Transformer learning curves show convergence around epoch 35 with early stopping triggered at epoch 45\.The model converges around epoch 35, with validation loss plateauing and early stopping triggered at epoch 45 \(patience=10\)\.
### VI\-EFeature Importance Analysis
#### VI\-E1XGBoost SHAP Values
Fig\.[4](https://arxiv.org/html/2606.07582#S6.F4)shows SHapley Additive exPlanations \(SHAP\)\[lundberg2017\]feature importance for the XGBoost component\.
Figure 4:XGBoost SHAP feature importance ranked by mean absolute value\. Age, IsActiveMember, and NumOfProducts emerge as the strongest predictors\.Key Findings:
- •Age: The strongest predictor \(SHAP = 0\.18\)\. Older customers show increased churn likelihood\.
- •IsActiveMember: Second strongest predictor \(SHAP = 0\.15\)\. Inactive customers are substantially more prone to churn\.
- •NumOfProducts: Exhibits a non\-monotonic trend: customers holding 1\-2 products show reduced risk, while those with 3\-4 products display elevated churn risk\.
- •Geography: Customers in Germany show higher churn probability relative to France and Spain\.
Directional Analysis:
- •Age: Positive SHAP values emerge for Age\>\>45, suggesting increased churn likelihood in older customers\.
- •IsActiveMember: Active accounts yield negative SHAP contributions \(protective\), while inactive accounts yield positive values \(risk\-enhancing\)\.
- •NumOfProducts: Neutral around 1\-2 products\. SHAP values exceed\+0\.10\+0\.10once holdings reach 3\+, consistent with over\-extension or dissatisfaction\.
- •Geography: Germany shows SHAP≈\+0\.08\\approx\+0\.08relative to France, potentially reflecting regional competition or service quality differentials\.
- •Balance: Extreme balances \(\>$100\>\\mathdollar 100K\) yield slightly positive SHAP values, suggesting a weak positive association between high balance and churn risk\.
#### VI\-E2FT\-Transformer Attention Weight
The average attention weights from the final transformer layer indicate that the \[CLS\] token attends most strongly to the following features:
1. 1\.Age \(attention weight = 0\.21\)
2. 2\.IsActiveMember \(attention weight = 0\.18\)
3. 3\.NumOfProducts \(attention weight = 0\.14\)
4. 4\.Balance \(attention weight = 0\.12\)
Model Agreement:Both XGBoost \(via SHAP values\) and the FT\-Transformer \(via attention weights\) identifyAgeandIsActiveMemberas the most influential features\. This cross\-model consistency supports the validity of the feature importance results\.
Figure 5:Feature\-to\-feature attention weights from the FT\-Transformer’s final layer\. Darker cells indicate stronger attention\. Top interactions are Age×\\timesNumOfProducts \(0\.21\), IsActiveMember×\\timesBalance \(0\.21\), Age×\\timesTenure \(0\.20\)\. The transformer models these dependencies simultaneously, while tree\-based models capture them sequentially\.
#### VI\-E3Feature\-to\-Feature Attention Patterns
Figure[5](https://arxiv.org/html/2606.07582#S6.F5)presents the feature\-to\-feature attention weights from the final transformer layer, revealing pairwise feature dependencies\.
Key Observations:
- •Age×\\timesNumOfProducts\(0\.21\): Strong mutual attention confirms the interaction identified in Table[VII](https://arxiv.org/html/2606.07582#S6.T7)\. Churn risk for 3\+ product holders increases sharply with age\.
- •IsActiveMember×\\timesBalance\(0\.21\): Account activity and balance are interdependent predictors\. A high balance is protective only for active members\.
- •Age×\\timesTenure\(0\.20\): Older customers with longer tenure exhibit distinct churn patterns compared to younger customers with similar tenure\.
- •CreditScore×\\timesIsActiveMember\(0\.17\): Active members with strong credit profiles exhibit lower churn, indicating financial stability as a retention factor\.
- •Geography×\\timesGender\(0\.17\): Regional differences in gender\-based churn align with market\-specific product positioning\.
Whereas XGBoost captures these interactions via recursive partitioning, transformer models capture all pairwise dependencies simultaneously\. This parallel processing enables the detection of interaction patterns that may be obscured by the hierarchical structure of decision trees\.
TABLE VII:Age×\\timesNumOfProducts Interaction: Churn RateObservation: Churn increases sharply for customers holding three or more products\. The effect is most pronounced for older customers \(Age\>\>50, Products≥3\\geq 3: 42% vs\. 12% for Age<<35 with a single product\)\.
This interaction is detected by both models\. FT\-Transformer assigns a high mutual attention weight \(0\.18\) to the pair \{Age, NumOfProducts\}, while XGBoost isolates the same pattern through recursive splits\. Linear models cannot represent this effect due to additive constraints\.
Additional Interactions:
- •Balance×\\timesIsActiveMember: High balance is protective for active members \(8% churn\) but not for inactive customers \(23%\)\.
- •Geography×\\timesGender: Germany shows uniform churn across genders\. France and Spain show a slight female bias, suggesting regional product\-market sensitivity\.
- •Tenure×\\timesAge: Younger customers with high tenure show the lowest churn \(<<5%\), indicating that early retention programs may have a long\-term impact\.
#### VI\-E4Permutation Importance Validation
Permutation importance testing was conducted to ensure that the FT\-Transformer relies on authentic predictive features rather than spurious correlations\. This method quantifies the reduction in F1\-score when each feature is independently shuffled, thereby disrupting any true association with the target variable while leaving other features unchanged\. Figure[6](https://arxiv.org/html/2606.07582#S6.F6)presents the permutation importance results\.
Figure 6:Permutation importance test illustrating F1\-score reduction when each feature is randomly shuffled\. Age \(7\.6%\), IsActiveMember \(7\.2%\), and NumOfProducts \(6\.4%\) emerge as critical features\. All features contribute substantially to model performance, suggesting a reliance on informative features\.Key Observations:
- •Age\(7\.6% drop\): Strongest predictor\. Permuting Age reduces F1 from 61\.0% to 53\.4%, confirming its central importance in churn prediction\.
- •IsActiveMember\(7\.2% drop\): Second most critical feature\. Account activity status is essential for accurate churn forecasting\.
- •NumOfProducts\(6\.4% drop\): Third\-ranked feature\. Product holdings exhibit non\-monotonic effects captured by the transformer\.
- •Ranking aligns with SHAP: Top three features match SHAP importance from Section[VI\-E](https://arxiv.org/html/2606.07582#S6.SS5), validating consistent feature attribution across methods\.
- •No noise features: All features exhibit measurable F1 degradation \(0\.9\-7\.6%\), with no zero or negative impacts\. The model relies on genuine predictive signals\.
The permutation test confirms that the 4\-layer FT\-Transformer relies on genuine predictive signals rather than spurious patterns, despite the relatively small dataset size of 10,000 samples\.
### VI\-FROC and Precision\-Recall Curves
Fig\.[7](https://arxiv.org/html/2606.07582#S6.F7)illustrates the Receiver Operating Characteristic \(ROC\) curves for the evaluated models\.
Figure 7:ROC curves comparing the stacked ensemble with the MLP baseline\.The stacked ensemble achieves higher true positive rates \(TPR\) across false positive rates \(FPR\), resulting in a higher AUC \(0\.861\) compared to the MLP baseline \(0\.834\), indicating stronger discrimination capability\.
Fig\.[8](https://arxiv.org/html/2606.07582#S6.F8)presents the precision–recall \(PR\) curves for the evaluated models\. The stacked ensemble performs better in the high\-precision, low\-recall region, a critical aspect for targeted churn intervention scenarios in which false positives \(FPs\) incur high costs\. Quantitatively, the stacked ensemble achieves a PR\-AUC of 0\.647, compared to 0\.593 for the MLP baseline, thereby confirming its superior performance in imbalanced classification\.
Figure 8:Precision–recall curves comparing the stacked ensemble with the MLP baseline\.Key Observations:
- •ROC:AUC improves from 0\.834 \(MLP\) to 0\.861 \(Stacked\), confirming higher discriminative power\.
- •PR:PR\-AUC increases from 0\.593 \(MLP\) to 0\.647 \(Stacked\)\. Gains are concentrated in the high\-precision region\. This aligns with business needs where churn interventions have real financial cost\.
### VI\-GProbability Calibration Analysis
Reliable probability estimates are critical for decision\-making in churn prediction\. A classifier is considered calibrated when, for samples assigned a predicted probabilitypp, roughly a proportionppof them churn in practice\. Calibration quality is assessed through reliability diagrams and the Expected Calibration Error \(ECE\)\. ECE is calculated by dividing the predicted probability interval\[0,1\]\[0,1\]intoMMdiscrete bins:
ECE=∑m=1M\|Bm\|N\|acc\(Bm\)−conf\(Bm\)\|\\text\{ECE\}=\\sum\_\{m=1\}^\{M\}\\frac\{\|B\_\{m\}\|\}\{N\}\|\\text\{acc\}\(B\_\{m\}\)\-\\text\{conf\}\(B\_\{m\}\)\|\(28\)whereBmB\_\{m\}denotes the samples assigned to binmm,acc\(Bm\)\\text\{acc\}\(B\_\{m\}\)is the empirical accuracy for that bin, andconf\(Bm\)\\text\{conf\}\(B\_\{m\}\)represents the mean predicted probability\. Table[VIII](https://arxiv.org/html/2606.07582#S6.T8)reports calibration metrics across all models\.
TABLE VIII:Probability Calibration MetricsFigure[9](https://arxiv.org/html/2606.07582#S6.F9)presents reliability diagrams comparing calibration quality across models\.
Figure 9:Reliability diagram showing predicted versus empirical probabilities\. XGBoost \(red\) exhibits overconfidence in the high\-probability region \(\>\>0\.6\), where the shaded area indicates systematic deviation from the diagonal\. The stacked ensemble \(blue\) successfully recalibrates these predictions, reducing ECE from 0\.052 to 0\.038\.The stacked ensemble achieves the lowest ECE \(0\.038\) and maximum deviation \(0\.074\)\. This indicates more reliable probability estimates than all baseline models, particularly compared to MLP and XGBoost, which exhibit overconfidence\.
The calibration improvement is driven by the logistic meta\-learner\. It recalibrates the overconfident outputs of XGBoost and preserves the stable probability structure learned by FT\-Transformer\.
### VI\-HInference Time Analysis
Table[IX](https://arxiv.org/html/2606.07582#S6.T9)reports average inference time per sample measured on CPU \(single core\) across the evaluated models\.
TABLE IX:Inference Time per Sample \(CPU, Single Core\)The stacked ensemble has higher latency than tree\-based models because the FT\-Transformer requires sequential attention computation across feature embeddings\. The marginal overhead from the logistic meta\-learner is negligible \(<<0\.05 ms/sample\)\. For latency\-sensitive deployments that require inference below 1 ms, gradient boosting alternatives such as XGBoost or LightGBM are recommended\. Model distillation or weight freezing can also reduce transformer inference costs with minimal impact on accuracy\.
### VI\-IError Analysis
An examination of the 501 false negatives \(missed churners\) reveals consistent error patterns:
- •Age: 45% fall between ages 35\-45, indicating a bias toward expecting churn in older customers\.
- •Activity Status: 62% are active members, suggesting over\-reliance on inactivity as a churn signal\.
- •Balance: 38% maintain a zero balance, a boundary case that is weakly represented in training\.
- •Tenure: 28% have tenure\>\>8 years\. Long\-tenure churn is rare and difficult to infer from available features\.
Failure Mode: The model struggles with “unexpected churners”, middle\-aged, active customers with moderate tenure whose departure may be driven by unobserved factors \(e\.g\., competitor incentives or life events\)\.
This motivates the ablation studies that follow, isolating the contributions of hybrid modeling components and assessing robustness to design decisions\.
### VI\-JIllustrative Case Studies
Three customer profiles illustrate model behavior across representative scenarios\. Interpretations reference aggregate SHAP values and attention weights from Section[VI\-E](https://arxiv.org/html/2606.07582#S6.SS5)\.
Case 1: High\-Confidence Correct Prediction \(True Positive\)
- •Profile: Age 52, Germany, Inactive, Balance $120K, 3 products
- •Predictions: FT\-Transformery^=0\.78\\hat\{y\}=0\.78; XGBoosty^=0\.82\\hat\{y\}=0\.82; Ensembley^=0\.81\\hat\{y\}=0\.81
- •Outcome: Churned
- •Interpretation: Four concurrent risk factors are present: older age \(SHAP = 0\.18\), inactive status \(SHAP = 0\.15\), Germany geography \(SHAP≈\+0\.08\\approx\+0\.08\), and 3\+ products\. The transformer assigns high attention to the Age×\\timesNumOfProducts interaction \(weight = 0\.21\), while XGBoost splits on the NumOfProducts≥3\\geq 3threshold\. Both models agree, producing a high\-confidence ensemble prediction\.
Case 2: Transformer Advantage on Edge Case \(FT Correct, XGBoost Incorrect\)
- •Profile: Age 41, France, Active, Balance $0, 2 products
- •Predictions: FT\-Transformery^=0\.62\\hat\{y\}=0\.62; XGBoosty^=0\.38\\hat\{y\}=0\.38; Ensembley^=0\.52\\hat\{y\}=0\.52
- •Outcome: Churned
- •Interpretation: Zero balance combined with active status represents an atypical combination \(3\.2% prevalence\)\. The transformer detects this through the IsActiveMember×\\timesBalance attention weight \(0\.21\)\. XGBoost underestimates the risk as its splits are calibrated for typical balance distributions\. The ensemble assigns greater weight to the transformer prediction in this region\.
Case 3: Missed Prediction — External Factors \(False Negative\)
- •Profile: Age 35, Spain, Active, Balance $85K, 1 product
- •Predictions: FT\-Transformery^=0\.28\\hat\{y\}=0\.28; XGBoosty^=0\.31\\hat\{y\}=0\.31; Ensembley^=0\.29\\hat\{y\}=0\.29
- •Outcome: Churned
- •Interpretation: No dominant SHAP contributors are present\. The profile shows no measurable risk factors in the available feature set\. Churn was likely driven by external factors such as competitor offers or life events\. This failure mode accounts for 28% of false negatives \(Section[VI\-I](https://arxiv.org/html/2606.07582#S6.SS9)\) and motivates the inclusion of temporal behavioral signals in future work\.
### VI\-KError Correlation and Synergy
The error correlation between FT\-Transformer and XGBoost isρ=0\.62\\rho=0\.62, indicating partially independent failures that enable ensemble improvement\.
Complementarity analysisacross all 10,000 predictions:
- •Both correct: 7,892 \(78\.9%\)
- •FT correct, XGB wrong: 534 \(5\.3%\)
- •XGB correct, FT wrong: 489 \(4\.9%\)
- •Both wrong: 1,085 \(10\.9%\)
The ensemble corrects 89\.1% of single\-model errors \(912 of 1,023 cases\)\. The remaining 1,085 failures correspond to profiles where churn drivers are absent from the feature set, as illustrated in Case 3\.
Meta\-learner weight stability: Across 5\-fold cross\-validation, learned coefficients show low variance:w1w\_\{1\}\(FT\-Transformer\)=0\.89±0\.04=0\.89\\pm 0\.04,w2w\_\{2\}\(XGBoost\)=0\.78±0\.03=0\.78\\pm 0\.03, andw0w\_\{0\}\(intercept\)=−0\.42±0\.05=\-0\.42\\pm 0\.05\. The higher transformer weight reflects its superior standalone F1 \(61\.00% vs 58\.21%\), while the XGBoost weight \(0\.78\) confirms complementary value\.
## VIIAblation and Sensitivity Analysis
All ablation studies use 5\-fold cross\-validation with 5 random seeds \(42, 123, 456, 789, 1011\)\. Results are reported as mean ± standard deviation to match Table[V](https://arxiv.org/html/2606.07582#S6.T5)\.
### VII\-AArchitectural Ablation Study
This study isolates the contribution of each model component\. Table[X](https://arxiv.org/html/2606.07582#S7.T10)reports model performance under the same settings\.
TABLE X:Ablation Study \- Component ContributionsObservations\.Under the same settings:
- •FT\-Transformer improves F1 by \+2\.27 over MLP \(58\.73 → 61\.00\)\.
- •Simple averaging \(FT\+XGB\) improves F1 by \+0\.45 over FT\-Transformer\.
- •Stacking achieves the highest scores: 62\.10 F1 and 0\.861 AUC\.
### VII\-BFT\-Transformer Architecture Variations
Table[XI](https://arxiv.org/html/2606.07582#S7.T11)examines the impact of FT\-Transformer architectural choices\.
TABLE XI:FT\-Transformer Architecture AblationObservations\.
- •Depth: 4 layers is optimal\. Deeper models \(6\-8 layers\) overfit on this dataset size\.
- •Embedding Dimension: 32 balances capacity and regularization\. 64\+ shows diminishing returns\.
- •Attention Heads: 8 heads provides sufficient multi\-head diversity\. More heads are not helpful\.
- •Dropout: 0\.1 is optimal\. Higher dropout \(0\.2\-0\.3\) over\-regularizes, 0\.0 overfits\.
This indicates that model capacity beyond the 4\-layer, 32\-dimensional, 8\-head configuration does not improve results under the same settings\.
A direct ablation comparing standard feature\-wise attention against intersample attention \(as in SAINT\[somepalli2021\]\) was not conducted in this study\. Intersample attention introduces row\-wise dependencies and a higher computational cost, which makes it less suitable for the latency constraints in Section[VI\-H](https://arxiv.org/html/2606.07582#S6.SS8)\. This comparison is identified as a direction for future work\.
### VII\-CEnsemble Component Analysis
The learned meta\-learner weights and model contributions are analyzed\.
Meta\-learner coefficients\.
p^stack=σ\(−0\.42\+0\.89⋅p^FT\+0\.78⋅p^XGB\)\\hat\{p\}\_\{stack\}=\\sigma\(\-0\.42\+0\.89\\cdot\\hat\{p\}\_\{FT\}\+0\.78\\cdot\\hat\{p\}\_\{XGB\}\)\(29\)
Observations\.
- •FT\-Transformer receives a higher coefficient \(0\.89\) than XGBoost \(0\.78\)\.
- •The intercept \(−0\.42\-0\.42\) shifts the decision boundary to adjust for class imbalance\.
Error correlation\.
ρ=Corr\(\|y−p^FT\|,\|y−p^XGB\|\)=0\.62\\rho=\\text\{Corr\}\(\|y\-\\hat\{p\}\_\{FT\}\|,\\ \|y\-\\hat\{p\}\_\{XGB\}\|\)=0\.62\(30\)
Interpretation\.A correlation of 0\.62 indicates that prediction errors are not fully aligned, leaving room for ensemble gains under the same settings\.
Prediction complementarity\.
- •Both correct: 7,892 samples \(78\.9%\)
- •FT correct, XGB wrong: 534 samples \(5\.3%\)
- •XGB correct, FT wrong: 489 samples \(4\.9%\)
- •Both wrong: 1,085 samples \(10\.9%\)
The ensemble correctly classifies 89\.1% of samples where at least one model is correct\.
This indicates that partial error independence contributes to the ensemble improvement under the same settings\.
### VII\-DSensitivity to Class Weighting
Table[XII](https://arxiv.org/html/2606.07582#S7.T12)reports the effect of class\-weight ratios on the stacked ensemble under the same settings\.
TABLE XII:Sensitivity to Class\-Weight RatioObservations\.
- •F1 peaks at 3:1 \(62\.10%\), which is below the empirical class ratio of 4:1\.
- •Higher weights increase recall \(61\.34% → 82\.65%\) while reducing precision \(57\.82% → 46\.11%\)\.
- •AUC remains stable across ratios \(0\.852–0\.861\), indicating minimal impact on ranking performance\.
At 3:1, the model balances recall and precision under the same evaluation settings\.
### VII\-EThreshold Tuning Analysis
Table[XIII](https://arxiv.org/html/2606.07582#S7.T13)analyzes how different thresholds influence the precision–recall trade\-off\.
TABLE XIII:Threshold Tuning ResultsBusiness scenario guidance\.
- •High\-volume, low\-cost interventions \(email or automated outreach\): thresholds 0\.3–0\.4 maximize recall \(82–88%\) with lower precision\.
- •Balanced intervention cost: thresholds 0\.5–0\.6 yield the highest F1 range \(62\.10–63\.46%\)\.
- •High\-cost interventions \(discount incentives or agent\-assisted retention\): thresholds 0\.7–0\.8 prioritize precision \(70–79%\), focusing on high\-confidence churners\.
### VII\-FTraining Data Size Sensitivity
Performance across varying training data sizes has been evaluated \(Table[XIV](https://arxiv.org/html/2606.07582#S7.T14)\)\.
TABLE XIV:Performance vs Training Data SizeObservations\.
- •XGBoost is the most data\-efficient model, and eventually achieves 96% of full performance with 25% data\.
- •FT\-Transformer benefits more from additional data, improving 6\.88 F1 points from 25% to 100%\.
- •The stacked ensemble consistently outperforms individual models across all data sizes\.
### VII\-GCross\-Validation Stability
Variance is assessed across random seeds and fold configurations \(Table[XV](https://arxiv.org/html/2606.07582#S7.T15)\)\.
TABLE XV:Cross\-Validation \(CV\) Stability AnalysisObservations\.
- •Increasing folds reduces variance: 5\-fold CV \(1 seed\) lowers std to 1\.12 compared to 3\-fold \(1\.45\)\.
- •Multiple seeds further stabilize results: 5\-fold with 5 seeds reduces std to 0\.72\.
- •10\-fold CV offers similar stability \(0\.89 std\) but with higher computational cost\.
5\-fold CV with 5 seeds is used in subsequent experiments as a stability–efficiency trade\-off under the same settings\.
### VII\-HMeta\-Learner Selection Analysis
Table[XVI](https://arxiv.org/html/2606.07582#S7.T16)compares stacking strategies under the same settings\.
TABLE XVI:Meta\-Learner ComparisonObservations\.
- •Simple averaging improves F1 to 61\.45, indicating baseline complementarity\.
- •Weighted averaging reaches 61\.87 \(wFT=0\.6, wXGB=0\.4\) after grid search, close to the best result\.
- •Logistic regression yields the highest F1=62\.10, with learned weights \(wFT=0\.89, wXGB=0\.78\)\.
- •Ridge and Lasso provide similar results \(62\.08 / 61\.94\), suggesting regularization has limited effect with two meta\-features\.
- •Gradient Boosting increases recall to 75\.67 but lowers precision to 52\.78, producing 61\.98 F1\.
- •A neural network meta\-learner underperforms \(61\.73 F1\), consistent with overfitting risk when only two inputs are available\.
Conclusion\.Under identical settings, logistic regression achieves the most favorable precision–recall balance\. With only two base models \(MM=2\), more complex meta\-learners may overfit and do not improve performance\.
Ablation results indicate that both architecture choices and the stacking strategy influence performance\. These findings guide the interpretation of subsequent results and the discussion of practical and research implications\.
## VIIIDiscussion
### VIII\-AInterpretation of Performance Gains
TABLE XVII:Performance Gains and Contributing FactorsTable[XVII](https://arxiv.org/html/2606.07582#S8.T17)summarizes the performance improvements and underlying mechanisms\. The following sections provide a detailed interpretation\.
FT\-Transformer vs MLP\.The FT\-Transformer demonstrates superior performance through three primary mechanisms\. First, dynamic feature weighting enables context\-dependent attention, withIsActiveMemberreceiving higher attention for high\-balance customers\. Second, the architecture captures higher\-order interactions such asAge×NumOfProducts\\textit\{Age\}\\times\\textit\{NumOfProducts\}\(attention weight 0\.18\), which aligns with observed churn patterns \(42% vs 12%\)\. Third, the unified feature tokenization supports cross\-type interactions \(e\.g\.,Geography×Balance\\textit\{Geography\}\\times\\textit\{Balance\}\) without requiring domain\-specific layers\.
Ensemble Improvements\.The stacking ensemble achieves gains through complementary modeling strategies\. FT\-Transformer captures continuous behavioral signals, whereas XGBoost models discrete threshold effects\. The error correlation ofρ=0\.62\\rho=0\.62indicates partially independent failure modes\. This error diversity allows the meta\-learner to recover 89\.1% of single\-model errors\. The learned stacking weights \(wFT=0\.89w\_\{FT\}=0\.89,wXGB=0\.78w\_\{XGB\}=0\.78\) reflect the relative contributions of each base model to the final prediction\.
### VIII\-BBusiness Implications and ROI Analysis
To assess practical deployment value, this analysis considers a mid\-sized bank with 100,000 customers, 20% annual churn rate, $2,000 customer lifetime value, and $50 intervention cost per customer\.
Operating Point \(τ=0\.5\\tau=0\.5\)\.At the default threshold, the model achieves 75\.4% recall and 53\.2% precision, flagging 28,346 customers and correctly identifying 15,080 actual churners\. The total intervention cost is $1\.42M, and the retained customer lifetime value is $7\.54M, yielding an estimated financial benefit of approximately $6\.12M annually\.
Comparison with MLP baseline\.The ensemble produces financial returns comparable to the MLP baseline \($6\.12M vs $6\.14M\) while reducing operational overhead by 14%, contacting 28,346 customers instead of 32,940\. This efficiency gain preserves Return on Investment \(ROI\) and reduces both customer contact fatigue and campaign costs\.
Threshold selection\.The framework supports flexible threshold tuning to match business objectives\. Low thresholds \(τ=0\.3\\tau=0\.3\) maximize coverage with 88% recall, suitable for low\-cost retention campaigns\. Balanced thresholds \(τ=0\.4\\tau=0\.4–0\.50\.5\) optimize net financial benefit across standard cost profiles\. High thresholds \(τ≥0\.7\\tau\\geq 0\.7\) achieve 70% precision for targeted premium customer outreach, minimizing false positives at the cost of lower coverage\.
Threshold choice depends on the CLV\-to\-intervention\-cost ratio and organizational tolerance for false positives\. Table[XIII](https://arxiv.org/html/2606.07582#S7.T13)offers a calibration template for cost\-aligned threshold tuning\[niculescu2005predicting\]\.
### VIII\-CTheoretical Implications for Tabular Deep Learning
The performance gains observed in this study provide empirical support for theoretical principles underlying hybrid model design on tabular data\.
Complementary inductive biases\.Tree\-based models partition the feature space through axis\-aligned splits, resulting in sharp decision boundaries suitable for categorical variables and threshold effects\. Transformers model continuous feature interactions via attention mechanisms, enabling effective representation of smooth gradients in numerical features\. This hybrid approach addresses limitations identified in previous research\. Grinsztajn et al\.\[grinsztajn2022\]reported that neural networks often struggle with the irregular decision boundaries common in tabular data\. Formally, axis\-aligned splits take the formxj≤tx\_\{j\}\\leq t, producing piecewise constant boundaries, whereas the attention matrix𝐀∈ℝn×n\\mathbf\{A\}\\in\\mathbb\{R\}^\{n\\times n\}captures global feature dependencies simultaneously\. These two mechanisms are structurally complementary rather than redundant\.
Error diversity and ensemble theory\.The observed error correlation of 0\.62 aligns with ensemble learning theory\[zhou2012\]\. A perfect correlation of 1\.0 yields no benefit from model combination\. A correlation of zero is unattainable with shared training data\. The moderate correlation observed here indicates that FT\-Transformer and XGBoost make errors in partially independent regions of the feature space\. This diversity of errors allows the meta\-learner to recover 89\.1% of single\-model failures\. From Equation \(24\), withM=2M=2models, equal varianceσ2\\sigma^\{2\}, and observed correlationρ=0\.62\\rho=0\.62, the ensemble variance reduces to0\.81σ20\.81\\sigma^\{2\}\. This represents a 19% reduction relative to any single model, consistent with the empirical F1 gain of 1\.10 points observed over the stronger base model\.
Calibration through stacking\.The reduction in Expected Calibration Error from 0\.051 \(XGBoost\) to 0\.038 \(ensemble\) suggests that meta\-learning can mitigate probability overconfidence without the need for separate calibration datasets\. This result extends prior work on stacking\[wolpert1992\]by showing that calibration improvement arises directly from the meta\-learner’s training objective during cross\-validation\.
### VIII\-DPositioning Against Prior Work
Table[XVIII](https://arxiv.org/html/2606.07582#S8.T18)compares the results of the present work with published benchmarks\.
TABLE XVIII:Comparison with Published ResultsStudyMethodAUCF1Xu et al\.\[xu2021\]XGBoost\+Stacking0\.89\*\-Burez\[burez2009\]Random Forest0\.82\*\-Gorishniy\[gorishniy2021\]FT\-Transformer0\.85\*\*\-Ahmad et al\.\[ahmad2023\]Hybrid Stacking0\.986\*\*\*\-Usman\-Hamza et al\.\[usmanhamza2024\]Multi\-layer Stacking0\.989\*\*\*97\.2%\*\*\*Warnakulaarachchi et al\.\[warnakulaarachchi2025\]Deep Ensemble\-87\.95%†This WorkFT\-Trans\+XGBoost0\.86162\.1%\*Different dataset; \*\*Average across multiple datasets\.\*\*\*SMOTE\-dependent; †Accuracy only\.Direct comparison with prior studies \(Table[XVIII](https://arxiv.org/html/2606.07582#S8.T18)\) is limited by dataset heterogeneity and metric choice\. Xuet al\.report 98% accuracy on telecommunications data\[xu2021\], but accuracy is not informative under imbalance\. A trivial “no churn” classifier reaches 80% on this dataset without any intervention value\. In contrast, the proposed model reaches 62\.1% F1 and 53\.2% precision, which better reflects retention\-quality\[burez2009,verbeke2012new\]\.
The achieved AUC of 0\.861 is comparable to Xuet al\.\(0\.89 on a different dataset\) and to FT\-Transformer reports across domains\[gorishniy2021\]\. However, ranking metrics alone do not reflect operational trade\-offs\. This work differs from prior modeling pipelines by integrating calibration\-aware stacking with FT\-Transformer and XGBoost\. The hybrid leverages complementary inductive biases where attention captures continuous interactions and trees capture discrete boundaries\[grinsztajn2022,shwartz2022tabular\]\. Ablation studies \(Tables[X](https://arxiv.org/html/2606.07582#S7.T10)\-[XVI](https://arxiv.org/html/2606.07582#S7.T16)\) support this contribution under controlled settings\.
Prior stacking work in biomedical and explainable modeling\[yang2025\]does not address calibration for decision reliability\. This study incorporates effect\-size reporting, confidence intervals, and seed\-averaged evaluation for statistical robustness\[guo2017calibration,naeini2015obtaining\]\. The six\-metric evaluation framework enables threshold\-dependent precision\-recall trade\-offs, which are essential for intervention planning and address the limitations of ranking metrics alone\.
Recent same\-domain studies report stronger aggregate metrics\. Ahmad et al\.\[ahmad2023\]achieve AUC 0\.986, and Usman\-Hamza et al\.\[usmanhamza2024\]report F1 0\.972 on telecom churn data\. However, both rely on SMOTE\-based oversampling, which introduces synthetic artifacts into minority class distributions\. Warnakulaarachchi et al\.\[warnakulaarachchi2025\]report 87\.95% accuracy on banking churn but do not report F1 or calibration metrics, limiting direct comparison\. Dataset heterogeneity and metric choices further constrain cross\-study comparisons\. This work prioritizes calibration\-aware evaluation, statistical validation, and reproducibility over raw metric maximization\.
## IXLimitations and Threats to Validity
The ensemble combines two complex models, limiting interpretability\.
### IX\-AExternal Validity
This study is based on a single dataset \(European banking, 2010–2015\)\. Generalization to telecommunications, e\-commerce, and insurance is not guaranteed, as churn mechanisms differ \(contract expiry vs subscription lapse vs account inactivity\)\. Organizations should validate the framework on proprietary datasets with domain\-specific features\[verbeke2012new,buckinx2005customer\]\.
### IX\-BFeature and Temporal Scope
Only static, cross\-sectional features were used\. Temporal signals that indicate behavioral drift \(declining balance, login decay, reduced product usage\) are not captured\. Rolling window statistics \(e\.g\.,Balance\_30d\_avg,Balance\_trend\) are recommended for production deployments\. Fairness evaluation across demographic groups such as age, gender, and geography was not conducted in this study and is included as a priority future research direction \(Section[XI](https://arxiv.org/html/2606.07582#S11)\)\.
### IX\-CEvaluation and Generalization
This study uses 5\-fold cross\-validation across 5 seeds \(25 evaluations\) without a held\-out test set\. The choice of 5\-fold over 10\-fold is supported by the stability analysis in Table[XV](https://arxiv.org/html/2606.07582#S7.T15)\. This shows that multiple seeds reduce variance more effectively than increasing fold count alone\. While this approach enhances robustness, it may also introduce optimism\[grinsztajn2022\]\.
- •Hyperparameter Selection Bias: settings in Table[IV](https://arxiv.org/html/2606.07582#S5.T4)were chosen from cross\-validation results instead of a separate development set\.
- •Architecture Choices: 4 layers, 8 attention heads, and embedding dimension 32 were selected from preliminary runs on the same dataset used for evaluation\.
- •Threshold Calibration: the operating point \(τ=0\.5\\tau=0\.5\-0\.60\.6\) was chosen from validation folds and not an independent calibration set\[niculescu2005predicting\]\.
These factors may lead to higher reported performance than would be expected on unseen data\.
### IX\-DExpected Production Performance
Based on established machine learning best practices and extensive cross\-validation results:
- •Conservative Estimate: Production performance may degrade by 1\-2% points in F1\-score when deployed on truly unseen data from the same distribution\[grinsztajn2022\]\(i\.e\., F1≈\\approx60\-61% rather than 62\.1%\)\.
- •Domain Shift: A 3\-5 point reduction when applied to new regions, time periods, or customer segments not represented in the training data\[verbeke2012new\]\.
- •Concept Drift: Further degradation over time if customer behavior changes, requiring periodic monitoring and retraining to maintain stability\[verbeke2012new\]\.
### IX\-ESupporting Factors for Reported Results
The following factors reduce the likelihood that reported performance reflects evaluation artifacts rather than model capability:
- •Stable results across 25 splits \(F1 std = 0\.72\) suggest low sensitivity to data partitions\.
- •Architectural choices \(self\-attention for interactions, GBM for boundaries\) follow domain structure rather than trial\-and\-error tuning\.
- •Ablation studies \(Section[VII](https://arxiv.org/html/2606.07582#S7)\) indicate that performance gains stem from model design components, not hyperparameter configurations\.
- •Large effect sizes in statistical testing \(Cohen’s d = 0\.9–2\.8\) support the robustness of improvements\.
### IX\-FRecommendations for Practitioners
- •Before deploymentHold out 10\-20% of data for testing\. Train on the remaining portion and evaluate once on the hold\-out set for an unbiased estimate\[grinsztajn2022\]\.
- •During deploymentRun A/B tests against existing churn systems or heuristics\. Monitor retention outcomes for contacted customers\[niculescu2005predicting\]\.
- •Post\-deploymentTrack monthly performance\. Retrain if F1 drops by more than 2 points or if concept drift is detected in feature distributions\[verbeke2012new\]\.
### IX\-GInterpreting Reported Metrics
The reported F1 of 62\.10% \(95% CI \[61\.65, 62\.55\]\), AUC of 0\.861 \(95% CI \[0\.858, 0\.864\]\), and PR\-AUC of 0\.647 represent performance estimates under the current validation protocol\. In production, expected performance is typically lower due to unseen data and operational variance\. A realistic target is 60\-61% F1, which remains competitive for imbalanced churn prediction\[burez2009,verbeke2012new\]\.
### IX\-HInterpretability Trade\-offs
Ensemble explanations require reconciling FT\-Transformer attention weights with XGBoost SHAP values\. Both methods consistently surfaceAgeandIsActiveMemberas dominant predictors \(Section[VI\-E](https://arxiv.org/html/2606.07582#S6.SS5)\)\. SHAP values can serve as a practical proxy for communicating ensemble behavior to stakeholders\[grinsztajn2022\]\.
### IX\-IApplicability Boundaries
This framework is not universally optimal\. Table[XIX](https://arxiv.org/html/2606.07582#S9.T19)summarizes conditions where alternative methods may be preferable\.
TABLE XIX:Applicability Boundaries and Alternatives
## XConclusion
This study demonstrates that systematic integration of transformer\-based feature learning with gradient\-boosted decision trees achieves strong performance in tabular churn prediction\. The proposed hybrid architecture addresses a key limitation in previous research by combining complementary inductive biases\. Attention mechanisms effectively capture continuous feature interactions, while tree\-based partitioning manages discrete decision boundaries\. Rigorous statistical validation across 25 repeated cross\-validation evaluations confirms that these performance gains are robust and reproducible\. Probability calibration analysis further shows that the framework produces reliable predictions suitable for cost\-sensitive business decisions\.
The empirical findings yield several key insights into hybrid modeling for tabular data\. First, the observed error correlation of 0\.62 between FT\-Transformer and XGBoost demonstrates that models with different inductive biases generate partially independent errors, which enable meaningful ensemble gains\. The meta\-learner successfully recovers 89\.1% of cases where one base model was correct and the other failed\. Second, probability calibration occurred naturally during the stacking process, eliminating the need for separate post\-processing\. The reduction in Expected Calibration Error \(ECE\) from 0\.051 to 0\.038 indicates that the logistic meta\-learner mitigates the overconfidence present in gradient\-boosted predictions\. Third, ablation studies confirm that both the transformer component and the stacking strategy are essential to performance, with neither redundant\.
The proposed framework demonstrates readiness for practical deployment by delivering measurable business value, such as improved retention and reduced intervention overhead\. While the evaluation uses banking data, the methodology remains domain\-agnostic and potentially transferable to telecommunications, insurance, e\-commerce, and subscription services, provided appropriate feature alignment\. Comprehensive algorithmic specifications and hyperparameter configurations support reproducibility and enable independent validation\.
## XIFuture Research Directions
Priority 1 \(Immediate\)\.
- •Multi\-domain validation across telecom, e\-commerce, and insurance benchmarks to assess transferability where churn mechanisms differ substantially\.
- •Temporal signals \(rolling\-window statistics\) for behavioral trajectory modeling\.
Priority 2 \(Model Capabilities\)\.
- •Extend ensemble to M\>\>2 models to study diversity–accuracy trade\-offs\.
- •Cost\-sensitive learning optimizing CLV\-adjusted retention benefit directly\.
- •Row\-wise attention \(SAINT\-style\) for rare churn precursor recognition\.
- •Employ adaptive embedding allocation based on categorical cardinality and mutual information with the target variable to enhance efficiency on heterogeneous feature sets\.
Priority 3 \(Production Readiness\)\.
- •Fairness audit to test for demographic performance gaps\.
- •Online learning pipelines for concept drift adaptation\.
- •Model compression or distillation for<10<10ms inference\.
## Appendix ANotation and Symbol Conventions
The mathematical symbols used are summarized in Table[XX](https://arxiv.org/html/2606.07582#A1.T20):
TABLE XX:Mathematical Notation ConventionsNotational Conventions:
- •Scalars: Italic lowercase or Greek letters \(e\.g\.,nn,λ\\lambda,τ\\tau\)
- •Vectors: Bold lowercase \(e\.g\.,𝐰j\\mathbf\{w\}\_\{j\},𝐱\(i\)\\mathbf\{x\}^\{\(i\)\},𝐞j\\mathbf\{e\}\_\{j\}\)
- •Matrices: Bold uppercase \(e\.g\.,𝐖Q\\mathbf\{W\}^\{Q\},𝐄\\mathbf\{E\},𝐙\\mathbf\{Z\}\)
- •Sample index: Superscripts in parentheses denote sample index \(e\.g\.,𝐱\(i\)\\mathbf\{x\}^\{\(i\)\}\)
- •Feature index: Subscripts denote feature or component index \(e\.g\.,xjx\_\{j\},𝐰j\\mathbf\{w\}\_\{j\}\)
## Appendix BEvaluation Metrics
Metrics used to evaluate model performance under class imbalance:
- •Recall \(Sensitivity\):TPTP\+FN\\frac\{TP\}\{TP\+FN\}, proportion of actual churners correctly identified\.
- •Precision:TPTP\+FP\\frac\{TP\}\{TP\+FP\}, proportion of predicted churners who actually churned\.
- •F1\-Score:2⋅Precision⋅RecallPrecision\+Recall2\\cdot\\frac\{Precision\\cdot Recall\}\{Precision\+Recall\}, harmonic mean of precision and recall\.
- •Specificity:TNTN\+FP\\frac\{TN\}\{TN\+FP\}, proportion of non\-churners correctly identified\.
- •AUC\-ROC: Area under ROC curve, probability that a random positive is ranked above a random negative\.
- •NPV \(Negative Predictive Value\):TNTN\+FN\\frac\{TN\}\{TN\+FN\}, proportion of predicted non\-churners confirmed as retained\.
Mean±\\pmstandard deviation is reported across 5 folds×\\times5 seeds\.
## Acknowledgment
The authors thank the Kaggle community for providing public access to the Bank Customer Churn dataset\. This research received no specific grant from any funding agency\.
## References
![[Uncaptioned image]](https://arxiv.org/html/2606.07582v1/Author/Joyjit.png)Joyjit Royis a senior technology and program management leader with over 21 years of experience in enterprise digital transformation, cloud modernization, and applied artificial intelligence initiatives\. He currently serves in a principal\-level leadership role, leading large\-scale modernization programs that integrate machine learning, intelligent automation, and Agile delivery frameworks across global enterprises\.His technical and research interests include applied AI and machine learning, natural language processing, multimodal systems, computer vision, agentic AI, edge AI, cybersecurity automation, and intelligent workflow orchestration\. His work focuses on translating emerging AI capabilities into scalable enterprise solutions that improve decision\-making, operational efficiency, and organizational resilience\.Mr\. Roy is a Senior Member of IEEE, a Fellow of the British Computer Society \(FBCS\), and a Fellow of the Association for Project Management \(FAPM\)\. He is an active contributor to the professional and academic community through peer review, technical judging, research publications, and speaking engagements\. His professional interests include enterprise AI adoption, governance\-driven automation, and technology leadership\.![[Uncaptioned image]](https://arxiv.org/html/2606.07582v1/Author/Samaresh.png)Samaresh Kumar Singhis a Principal Engineer with over 21 years of experience architecting large\-scale distributed systems across edge computing, IoT/IIoT, agentic AI, cloud platforms, and cybersecurity\. He specializes in resilient, low\-latency architectures for deploying and operating AI/ML workloads across heterogeneous hardware in cloud\-edge environments\. He has led major platform modernization initiatives, developed distributed compute systems, and driven substantial gains in scalability, reliability, and performance across production infrastructure\.Mr\. Singh holds a master’s degree in Computer Software Engineering from Colorado Technical University, USA, and a bachelor’s degree in Computer Engineering from the Institute of Engineering and Management, India\. He has contributed to widely adopted open\-source initiatives spanning AI/ML frameworks, large language model inference and serving platforms, computer vision and scientific computing libraries, distributed systems, observability infrastructure, performance\-critical systems software, and actively engages with technical communities through code contributions, design reviews, and mentorship\.His research interests include distributed edge intelligence, trustworthy AI, IDS/IPS systems, hardware\-aware model deployment, and performance and energy\-aware orchestration at scale\. He is a Senior Member of IEEE, an active technical reviewer, and a mentor in the global engineering community\.![[Uncaptioned image]](https://arxiv.org/html/2606.07582v1/Author/Laxmi.jpg)Laxmi Shawis a Lead Data Scientist at a multinational financial organization in the United States\. From 2023 to 2025, she was a Postdoctoral Scholar in the Department of Information Systems and Analytics at Texas State University and a research collaborator at The University of Texas at Austin, Dell Medical School, where her work focused on adversarial machine learning, large language models, healthcare fraud analytics, Gen AI, computer vision, agentic AI and predictive biomarker modeling using high\-performance computing\. She has over fifteen years of combined research and industry experience, including roles at Samsung Research and Carrier Corporation, with expertise spanning AI\-driven analytics, IoT systems, and digital\-twin modeling\.Dr\. Shaw received the Ph\.D\. degree in electrical engineering with a specialization in artificial intelligence and machine learning from IIT Kharagpur, India\. She also holds the M\.Tech\. degree from Jadavpur University and the B\.E\. degree from Sambalpur University\. She is the author of six books and has published over 42 peer\-reviewed papers in journals, conferences, and book chapters\.Her research interests include AI/ML security, EEG signal processing, adversarial robustness in LLMs, Siamese networks, and GPU\-accelerated healthcare analytics\. She is a Senior Member of IEEE, an active reviewer for several IEEE, Springer Nature and Elsevier venues, and a recipient of awards including IEEE Best Paper Awards and the ACM\-W Women in Smart Computing Award\. Since January 2025, she has been serving as Adjunct Faculty and Research Scientist at Texas A&M University–Victoria\.\\EODSimilar Articles
ChurnNet: A Optimized Modern AI for Churn Prediction
This paper evaluates traditional machine learning techniques (Random Forests, XGBoost, SVM) against a deep learning model (Unified Multi-Task Time Series Model) for customer churn prediction in retail, finding that conventional methods can outperform in predictive performance and efficiency.
A Rolling-Window Framework for Churn Prediction and Behavioral Driver Identification
This paper proposes a rolling-window framework for customer churn prediction in non-contractual service environments, using 30-day behavioral windows to enable continuous risk assessment. Evaluated on real-world data, the feature-based model achieves 87.6% accuracy and 0.94 ROC-AUC, while the sequence-based model reaches 96.1% recall.
Temporal Contrastive Transformer for Financial Crime Detection: Self-Supervised Sequence Embeddings via Predictive Contrastive Coding
Introduces the Temporal Contrastive Transformer (TCT), a self-supervised framework for learning temporal embeddings from financial transactions for fraud detection. Achieves AUC 0.8644 with embeddings alone but does not improve over strong engineered features (AUC 0.9205 vs 0.9245), indicating learned representations overlap with existing features.
DT-Transformer: A Foundation Model for Disease Trajectory Prediction on a Real-world Health System
DT-Transformer is a foundation model trained on 57.1 million structured EHR entries from 1.7 million patients across 11 hospitals in the Mass General Brigham health system, achieving strong discrimination for next-event prediction across 896 disease categories.
Forecasting Medium-Horizon Alzheimer's Disease Progression: Residual Gap-Aware Transformers for 24-Month CDR-SB Change from ADNI Clinical and Biomarker Histories
This paper proposes a residual gap-aware transformer that combines a mixed-effects statistical reference with transformer-based residual learning to forecast 24-month CDR-SB change from ADNI clinical and biomarker histories, achieving reduced MSE and improved correlation over baselines.