Federated Survival Analysis in Healthcare: A Multi-Model Evaluation on Cross-Institutional Heterogeneous Breast Cancer Data

arXiv cs.LG Papers

Summary

This paper systematically evaluates three survival models (Cox, DeepSurv, RSF) under federated learning on heterogeneous breast cancer data, finding that FL outperforms local training and RSF offers the best balance of performance across clients.

arXiv:2606.23871v1 Announce Type: new Abstract: Survival analysis is central to clinical decision-making, yet reliable time-to-event models require large, diverse cohorts that are rarely available at a single institution, while privacy regulations restrict the centralization of patient data. Federated learning (FL) offers a privacy-preserving alternative by training shared models without exchanging raw data, but its effectiveness for survival modeling under realistic, heterogeneous conditions remains insufficiently understood. This paper presents a systematic, multi-model evaluation of federated survival analysis on a cross-institutional breast cancer cohort with naturally heterogeneous distributed clients. Three representative survival models, the Cox Proportional Hazards model, DeepSurv, and Random Survival Forest (RSF), are compared across centralized, local, and federated training, and three federated optimization strategies (FedAvg, FedProx, and FedAdam) are assessed for the gradient-based models. Results show that FL consistently outperforms local training and approaches, and occasionally exceeds, centralized performance, while RSF offers the best overall balance of discrimination, calibration, and robustness across heterogeneous clients. We further find that performance depends on the diversity of client distributions, and that FedAvg and FedProx are stronger and more stable than FedAdam. Based on these findings, we derive practical, decision-oriented guidelines mapping data, privacy, interpretability, and resource constraints to recommended model and training-paradigm choices for federated survival modeling in healthcare.
Original Article
View Cached Full Text

Cached at: 06/24/26, 07:49 AM

# Federated Survival Analysis in Healthcare: A Multi-Model Evaluation on Cross-Institutional Heterogeneous Breast Cancer Data
Source: [https://arxiv.org/html/2606.23871](https://arxiv.org/html/2606.23871)
Anusha IhalapathiranaPekka SiirtolaMiguel Fernandez\-de\-RetanaThis work was supported by the Basque Government under grant DEUSTEK6 – Humanized Computing for Smart Sustainable and Healthier Communities and Environments \(IT1901\-26\), and by the European Union’s Horizon Europe Research and Innovation Programme under the LATE\-AYA project \(Grant Agreement No\. 101214326\)\. N\. Moreno\-Blasco is with the Faculty of Information Technology and Electrical Engineering, University of Oulu, Oulu, Finland \(e\-mail: natalia\.moreno@student\.oulu\.fi\)\.A\. Ihalapathirana and P\. Siirtola are with the Biomimetics and Intelligent Systems Group, University of Oulu, Oulu, Finland \(e\-mails: \{anusha\.ihalapathirana, pekka\.siirtola\}@oulu\.fi\)\.M\. Fernandez\-de\-Retana is with the Faculty of Engineering, University of Deusto, Bilbao, Spain \(e\-mail: m\.fernandezderetana@deusto\.es\)\.

###### Abstract

Survival analysis is central to clinical decision\-making, yet reliable time\-to\-event models require large, diverse cohorts that are rarely available at a single institution, while privacy regulations restrict the centralization of patient data\. Federated learning \(FL\) offers a privacy\-preserving alternative by training shared models without exchanging raw data, but its effectiveness for survival modeling under realistic, heterogeneous conditions remains insufficiently understood\. This paper presents a systematic, multi\-model evaluation of federated survival analysis on a cross\-institutional breast cancer cohort with naturally heterogeneous distributed clients\. Three representative survival models, the Cox Proportional Hazards model, DeepSurv, and Random Survival Forest \(RSF\), are compared across centralized, local, and federated training, and three federated optimization strategies \(FedAvg, FedProx, and FedAdam\) are assessed for the gradient\-based models\. Results show that FL consistently outperforms local training and approaches, and occasionally exceeds, centralized performance, while RSF offers the best overall balance of discrimination, calibration, and robustness across heterogeneous clients\. We further find that performance depends on the diversity of client distributions, and that FedAvg and FedProx are stronger and more stable than FedAdam\. Based on these findings, we derive practical, decision\-oriented guidelines mapping data, privacy, interpretability, and resource constraints to recommended model and training\-paradigm choices for federated survival modeling in healthcare\.

\{IEEEkeywords\}

Breast Cancer, Federated Learning, Healthcare Data, ML Privacy, Oncology, Survival Analysis

## 1Introduction

\\IEEEPARstart

The accelerating integration of artificial intelligence \(AI\) into clinical practice has turned data\-driven systems into a central tool for tasks such as diagnosis, prognosis, treatment planning, and disease\-progression forecasting\[[41](https://arxiv.org/html/2606.23871#bib.bib10),[31](https://arxiv.org/html/2606.23871#bib.bib11)\]\. These systems can support clinicians in making more accurate and timely decisions by learning patterns from large collections of patient records\[[21](https://arxiv.org/html/2606.23871#bib.bib12)\]\. However, their deployment in healthcare is fundamentally constrained by the decentralized and privacy\-sensitive nature of medical data, which is typically siloed across hospitals, laboratories, and research centers, each maintaining its own storage infrastructure, governance procedures, and ethical oversight\[[39](https://arxiv.org/html/2606.23871#bib.bib13)\]\. Even when technical interoperability is feasible, legal and ethical frameworks such as the General Data Protection Regulation \(GDPR\)\[[10](https://arxiv.org/html/2606.23871#bib.bib14)\]in Europe and the Health Insurance Portability and Accountability Act \(HIPAA\)\[[34](https://arxiv.org/html/2606.23871#bib.bib15)\]in the United States restrict the centralization of patient information\.

This situation creates a tension between data accessibility and data privacy\. On the one hand, robust machine learning \(ML\) models require large, diverse, and representative datasets to achieve generalizable predictions; on the other hand, the sensitivity of clinical data makes unrestricted aggregation neither acceptable nor viable\. As a consequence, many studies rely on single\-institution datasets, producing models that perform well in one clinical setting but fail to generalize across others, a phenomenon commonly referred to as dataset bias or domain shift\[[21](https://arxiv.org/html/2606.23871#bib.bib12)\]\. Beyond predictive performance, models trained on homogeneous data can also reflect the demographic or procedural biases of the institutions where the data were collected, raising important ethical and fairness concerns\[[5](https://arxiv.org/html/2606.23871#bib.bib16)\]\.

Federated learning \(FL\), introduced by McMahan et al\.\[[32](https://arxiv.org/html/2606.23871#bib.bib7)\], has emerged as a promising paradigm to address these challenges\. Instead of pooling data in a single location, FL enables multiple institutions to collaboratively train a shared model by exchanging only model parameters or updates, while keeping the underlying patient data local and private\. This approach preserves confidentiality, promotes cross\-institutional cooperation, and improves generalization by exposing the model to a broader range of data distributions\[[39](https://arxiv.org/html/2606.23871#bib.bib13)\]\. FL has already shown promise in healthcare applications such as multi\-institutional prediction of clinical outcomes in COVID\-19 patients\[[9](https://arxiv.org/html/2606.23871#bib.bib17)\]and privacy\-preserving brain tumour segmentation\[[28](https://arxiv.org/html/2606.23871#bib.bib18)\]\. Nevertheless, FL in healthcare remains an emerging field\[[19](https://arxiv.org/html/2606.23871#bib.bib19)\]: practical deployments must contend with communication overhead, statistical heterogeneity across clients \(non\-independent and identically distributed, or non\-iid, data\)\[[29](https://arxiv.org/html/2606.23871#bib.bib6)\], and the fact that sharing model updates does not provide absolute privacy, as gradient\-leakage attacks can partially reconstruct private inputs\[[12](https://arxiv.org/html/2606.23871#bib.bib20),[43](https://arxiv.org/html/2606.23871#bib.bib21),[11](https://arxiv.org/html/2606.23871#bib.bib1)\]\.

A clinically central yet comparatively underexplored application of FL is survival analysis, the branch of statistics concerned with modeling the time until an event of interest, such as disease progression, relapse, or death, occurs\[[7](https://arxiv.org/html/2606.23871#bib.bib22),[24](https://arxiv.org/html/2606.23871#bib.bib23)\]\. Unlike standard regression or classification, survival models must explicitly account for censoring, that is, individuals for whom the event has not been observed within the study period\. Survival analysis plays a decisive role in oncology, where breast cancer is the most frequently diagnosed cancer worldwide\[[40](https://arxiv.org/html/2606.23871#bib.bib24)\]and where reliable time\-to\-event estimates inform prognosis and treatment decisions\. Because such models benefit from large and diverse cohorts that are rarely available at a single institution, survival analysis is a natural candidate for federated settings; at the same time, censoring, unbalanced event rates, and inter\-institutional differences make the federated formulation methodologically challenging\.

The combination of FL and survival analysis is a recent but rapidly growing research direction\. FedSurF\+\+\[[2](https://arxiv.org/html/2606.23871#bib.bib25)\]extends Random Survival Forests \(RSF\) to the federated setting by aggregating locally trained survival trees in a single communication round\. Andreux et al\.\[[1](https://arxiv.org/html/2606.23871#bib.bib26)\]showed that naively federating the Cox Proportional Hazards \(CoxPH\) model yields a stratified Cox formulation that degrades under heterogeneity, and proposed a discrete\-time reformulation with a separable loss to enable effective federated training\. More recently, FedScore\-Surv\[[26](https://arxiv.org/html/2606.23871#bib.bib27)\]developed privacy\-preserving federated time\-to\-event scores across institutions, and FedPseudo\[[37](https://arxiv.org/html/2606.23871#bib.bib28)\]introduced a pseudo\-value\-based deep learning framework for federated survival modeling\. These contributions complement a broader body of FL\-in\-healthcare work spanning intensive\-care mortality prediction\[[33](https://arxiv.org/html/2606.23871#bib.bib29)\], multi\-modal COVID\-19 diagnosis\[[36](https://arxiv.org/html/2606.23871#bib.bib30)\], anomaly detection in the Internet of Medical Things \(IoMT\)\[[14](https://arxiv.org/html/2606.23871#bib.bib31)\], and cross\-federation knowledge distillation\[[6](https://arxiv.org/html/2606.23871#bib.bib32)\]\.

Despite this progress, most existing studies either focus on a single model family or propose a new algorithm evaluated in isolation\. Few works systematically compare different model families across training paradigms under controlled data heterogeneity, and ensemble methods such as RSF remain largely unexplored in federated settings beyond FedSurF\+\+\. Crucially, there is a lack of practical, decision\-oriented guidance on*which*survival model and*which*training paradigm to choose under given data characteristics, privacy constraints, and computational budgets\. This gap motivates a regime\-based, comparative evaluation rather than the pursuit of a single best\-performing model\.

In this work111Code publicly available at:[https://github\.com/nataliamorenob/Survival\-Models\-in\-Federated\-Healthcare\-Settings](https://github.com/nataliamorenob/Survival-Models-in-Federated-Healthcare-Settings), we present a systematic, multi\-model evaluation of federated survival analysis on cross\-institutional, heterogeneous breast cancer data\. We compare three representative survival models, namely the statistical Cox Proportional Hazards \(CoxPH\) model, the deep\-learning\-based DeepSurv model, and the tree\-ensemble Random Survival Forest \(RSF\) model, across three training paradigms \(centralized, local, and federated\), using the Fed\-TCGA\-BRCA dataset from the FLamby benchmark suite\[[35](https://arxiv.org/html/2606.23871#bib.bib33)\]\. The main contributions of this paper are as follows:

- •A unified empirical comparison of CoxPH, DeepSurv, and RSF under local, federated, and centralized training on a real, multi\-institutional breast cancer cohort with naturally heterogeneous client distributions\.
- •An analysis of how data heterogeneity, expressed through the number and composition of participating clients, affects both discrimination and calibration of federated survival models\.
- •An evaluation of federated optimization strategies \(FedAvg, FedProx, and FedAdam\) for the two gradient\-based survival models, assessing their robustness under heterogeneous client distributions\.
- •A set of practical, decision\-oriented guidelines that map data, privacy, interpretability, and resource constraints to recommended model and training\-paradigm choices\.

The remainder of this paper is organized as follows\. Section[2](https://arxiv.org/html/2606.23871#S2)describes the dataset, the survival models, the learning paradigms and federated optimization strategies, and the experimental setup\. Section[3](https://arxiv.org/html/2606.23871#S3)reports the evaluation metrics and experimental results, Section[4](https://arxiv.org/html/2606.23871#S4)discusses the main findings and derives practical guidelines, and the final section concludes the paper and outlines future research directions\.

## 2Methods

This section describes the methodological framework adopted in this study\. First, the Fed\-TCGA\-BRCA dataset is presented, including a characterization of its cross\-institutional heterogeneity\. Then, the three survival models are introduced together with their suitability for federated time\-to\-event modeling\. Finally, the learning paradigms, the federated optimization strategies, and the experimental setup are detailed\.

### 2\.1Dataset and Preprocessing

The experiments are conducted on the Fed\-TCGA\-BRCA dataset, obtained from the FLamby benchmark suite\[[35](https://arxiv.org/html/2606.23871#bib.bib33)\], which is specifically designed for cross\-silo federated learning in realistic healthcare settings\. The underlying data originate from The Cancer Genome Atlas Program \(TCGA\)\[[42](https://arxiv.org/html/2606.23871#bib.bib34)\], one of the largest publicly available cancer genomics resources, accessible through the Genomic Data Commons \(GDC\) data portal\[[17](https://arxiv.org/html/2606.23871#bib.bib35)\]\. The cohort focuses on breast invasive carcinoma \(BRCA\) and integrates the clinical, pathological, and molecular profiles of more than one thousand patients collected across multiple institutions worldwide\. For the FLamby benchmark, only the clinical tabular subset of TCGA\-BRCA is used, as it provides structured, interpretable variables suitable for tabular ML and allows modeling of the time from diagnosis or treatment initiation until death or last follow\-up\.

Each patient is represented by3939features grouped into demographic \(e\.g\., age, ethnicity, and race indicators\), pathological \(e\.g\., tumor, node, metastasis, and overall stage variables\), diagnostic and coding \(ICD\-10 and morphology indicators\), clinical history and treatment \(prior malignancy and therapy indicators\), and simplified tumor\-stage variables\. The outcome is described by an event indicatorEE\(i\.e\.,11for an observed death,0for a right\-censored observation\) and a survival timeTTmeasured in days\. The dataset is partitioned into training and test sets according to the original FLamby design\. The training set is further split into local training and validation sets within each client, as described in Section[2\.5](https://arxiv.org/html/2606.23871#S2.SS5)\.

Importantly, a defining property of this dataset is its natural cross\-institutionalheterogeneity\. Table[1](https://arxiv.org/html/2606.23871#S2.T1)summarizes the cohort across the six original centers, reporting the number of patients, observed events \(deaths\), censored cases, the range of observed follow\-up times, and the event rate\. The dataset comprises1,0881\{,\}088patients with151151recorded events and937937censored observations, yielding an overall event rate of approximately13\.9%13\.9\\%\. The centers differ markedly in size \(from5151to311311patients\), event rate \(from5\.6%5\.6\\%to19\.9%19\.9\\%\), and maximum observed follow\-up \(from1,9001\{,\}900to8,6058\{,\}605days\), and they also exhibit differing age distributions and survival profiles\. This combination of variable sample sizes, censoring levels, and follow\-up windows reproduces the non\-iidconditions encountered in real multi\-institutional studies and makes the dataset well suited for evaluating federated survival models under realistic heterogeneity\.

Table 1:Summary of the Fed\-TCGA\-BRCA Dataset by Center \(Combined Training and Test Data\)Note:Experiments use up to five clients, mapped to centers\{0,…,4\}\\\{0,\\ldots,4\\\}, as described in Section[2\.5](https://arxiv.org/html/2606.23871#S2.SS5)\. Client 5 was excluded due to limited events\.

Prior to training, the data of each client are split using a stratified, event\-aware strategy that preserves the event\-to\-censoring ratio and avoids data leakage\. The held\-out test partition is the one provided by FLamby; the remaining data are further divided into training and validation sets, with the validation proportion adapted to the local number of observed events to guarantee sufficient event representation:

val\_size=\{0\.30if​nevents≥200\.25if​10≤nevents<200\.20if​nevents<10\\text\{val\\\_size\}=\\begin\{cases\}0\.30&\\text\{if \}n\_\{\\text\{events\}\}\\geq 20\\\\ 0\.25&\\text\{if \}10\\leq n\_\{\\text\{events\}\}<20\\\\ 0\.20&\\text\{if \}n\_\{\\text\{events\}\}<10\\end\{cases\}\(1\)with the additional constraint that each validation set contains at least two uncensored events\. Continuous features are standardized usingper\-clientzz\-score normalization, with the mean and standard deviation computed exclusively on the training data and subsequently applied to the validation and test partitions to avoid leakage; the remaining binary features are left unscaled\.

### 2\.2Survival Models

Survival analysis deals with time\-to\-event data in the presence of censoring\. For each individualii, letTiT\_\{i\}denote the \(possibly unobserved\) event time andCiC\_\{i\}the censoring time; the observed follow\-up time and event indicator are:

Yi=min⁡\(Ti,Ci\),δi=𝕀​\(Ti≤Ci\)Y\_\{i\}=\\min\(T\_\{i\},C\_\{i\}\),\\qquad\\delta\_\{i\}=\\mathbb\{I\}\(T\_\{i\}\\leq C\_\{i\}\)\(2\)so thatδi=1\\delta\_\{i\}=1when the event is observed andδi=0\\delta\_\{i\}=0when the observation is right\-censored\. Together with a covariate vector𝐱i∈ℝp\\mathbf\{x\}\_\{i\}\\in\\mathbb\{R\}^\{p\}, the observed data form a collection of triplets\(𝐱i,Yi,δi\)\(\\mathbf\{x\}\_\{i\},Y\_\{i\},\\delta\_\{i\}\)\. The two quantities of interest are the survival functionS​\(t\)=ℙ​\(T\>t\)S\(t\)=\\mathbb\{P\}\(T\>t\)and the hazard function:

h​\(t\)=limΔ​t→0ℙ​\(t≤T​<t\+Δ​t∣​T≥t\)Δ​th\(t\)=\\lim\_\{\\Delta t\\to 0\}\\dfrac\{\\mathbb\{P\}\(t\\leq T<t\+\\Delta t\\mid T\\geq t\)\}\{\\Delta t\}\(3\)which represents the instantaneous risk of experiencing the event at timettgiven survival up to that time\. Survival models aim to estimate these functions as a function of the covariates𝐱\\mathbf\{x\}, enabling predictions of individual survival probabilities and risk scores\.

To compare complementary modeling philosophies, three survival models with native support for censored data are evaluated: the statistical CoxPH model, the deep\-learning\-based DeepSurv model, and the tree\-ensemble RSF model\. Each model is trained and evaluated under the centralized, local, and federated paradigms described in Section[2\.3](https://arxiv.org/html/2606.23871#S2.SS3)\.

#### 2\.2\.1Cox Proportional Hazards \(CoxPH\)

The CoxPH model\[[8](https://arxiv.org/html/2606.23871#bib.bib36)\]expresses the hazard of an individual as the product of an unspecified baseline hazardh0​\(t\)h\_\{0\}\(t\)and an exponential function of the covariates:

h​\(t∣𝐱i\)=h0​\(t\)​exp⁡\(𝜷⊤​𝐱i\)h\(t\\mid\\mathbf\{x\}\_\{i\}\)=h\_\{0\}\(t\)\\,\\exp\\\!\\left\(\\boldsymbol\{\\beta\}^\{\\top\}\\mathbf\{x\}\_\{i\}\\right\)\(4\)under theproportional\-hazardsassumption that the hazard ratio between any two individuals is constant over time\. Being semi\-parametric, the model estimates the coefficients𝜷\\boldsymbol\{\\beta\}through partial\-likelihood maximization without specifyingh0​\(t\)h\_\{0\}\(t\)\. Its interpretability, simplicity, and long history make CoxPH a standard reference in clinical survival analysis\[[24](https://arxiv.org/html/2606.23871#bib.bib23)\]\. A known limitation in distributed settings, however, is its reliance on global risk sets: the partial likelihood couples individuals across the whole cohort, information that is not locally available when data cannot be shared\[[1](https://arxiv.org/html/2606.23871#bib.bib26)\]\. In this work, CoxPH serves as a classical statistical baseline against which more flexible models are compared\.

#### 2\.2\.2DeepSurv

DeepSurv\[[20](https://arxiv.org/html/2606.23871#bib.bib5)\]is a deep\-learning extension of CoxPH in which the linear risk term𝜷⊤​𝐱\\boldsymbol\{\\beta\}^\{\\top\}\\mathbf\{x\}is replaced by a neural networkg​\(𝐱;𝜽\)g\(\\mathbf\{x\};\\boldsymbol\{\\theta\}\), allowing nonlinear covariate effects to be modeled\. The network consists of an input layer, hidden layers with nonlinear activations, and a single output that predicts a log\-risk score; and is trained by minimizing the negative Cox partial log\-likelihood using gradient\-based optimization\. DeepSurv is a widely adopted neural survival baseline, and several related deep models, such as DeepHit\[[25](https://arxiv.org/html/2606.23871#bib.bib37)\]or VAECox\[[22](https://arxiv.org/html/2606.23871#bib.bib38)\]relax the proportional\-hazards assumption further\. Because DeepSurv inherits the same partial\-likelihood loss as CoxPH, the Cox loss must be approximated locally in the federated setting, since the exact global risk sets cannot be reconstructed without exchanging survival\-time information between clients\.

#### 2\.2\.3Random Survival Forests \(RSF\)

RSF\[[18](https://arxiv.org/html/2606.23871#bib.bib40)\]extends Random Forests\[[4](https://arxiv.org/html/2606.23871#bib.bib41)\]to survival data by growing an ensemble of survival trees, each trained on a bootstrap sample and split using survival\-specific criteria, and aggregating their cumulative\-hazard estimates\. RSF captures nonlinear relationships and covariate interactions without parametric assumptions, and has demonstrated strong predictive performance on heterogeneous clinical datasets\[[30](https://arxiv.org/html/2606.23871#bib.bib42)\]\. In the federated setting, RSF follows the FedSurF\+\+ approach\[[2](https://arxiv.org/html/2606.23871#bib.bib25)\], which operates on fully trained trees rather than gradients\. In a local\-training step, each clientkkfits a forest on its local data and evaluates the quality of its trees on a local validation set using the concordance index, while sending only metadata \(the number of treesTkT\_\{k\}and samplesNkN\_\{k\}\) to the server\. The server then performs a tree\-assignment step, distributing theTTglobal slots among clients with probability proportional toNkN\_\{k\}; finally, in a tree\-sampling step, each client samples its allocated trees without replacement, with selection probability proportional to validation performance, and sends them to the server, which aggregates them into a global federated forest\. Because aggregation operates directly on trees, the federated procedure requires only a single communication round, substantially reducing communication overhead relative to iterative gradient\-based methods\.

### 2\.3Learning Paradigms

To assess the effect of data decentralization on survival modeling, three learning paradigms that differ in how data are accessed during training are considered\.

##### Centralized learning

All client data are pooled into a single set before training, and the model is trained as if originating from one source\. After training, the model is evaluated on each client’s local test set\. This paradigm assumes that data sharing is possible and serves as a best\-case reference\.

##### Local learning

Each client trains an independent model on its own data, with no sharing of data, parameters, or updates\. This fully privacy\-preserving but non\-collaborative setting provides a lower\-reference baseline whose performance is expected to suffer under data scarcity and heterogeneity\.

##### Federated learning

The training data remain at each institution, and a shared global model is obtained through repeated communication between the clients and a central server\. AssumingKKclients, where clientkkholds a local dataset𝒟k\\mathcal\{D\}\_\{k\}of sizenkn\_\{k\}andN=∑k=1KnkN=\\sum\_\{k=1\}^\{K\}n\_\{k\}, the federated objective minimizes the data\-weighted average of the local empirical risks:

f​\(w\)\\displaystyle f\(w\)=∑k=1KnkN​Fk​\(w\)\\displaystyle=\\sum\_\{k=1\}^\{K\}\\frac\{n\_\{k\}\}\{N\}\\,F\_\{k\}\(w\)\(5\)Fk​\(w\)\\displaystyle F\_\{k\}\(w\)=1nk​∑\(xi,yi\)∈𝒟kℓ​\(w;xi,yi\)\\displaystyle=\\frac\{1\}\{n\_\{k\}\}\\sum\_\{\(x\_\{i\},y\_\{i\}\)\\in\\mathcal\{D\}\_\{k\}\}\\ell\(w;x\_\{i\},y\_\{i\}\)\(6\)wherewwdenotes the model parameters andℓ\\ellthe per\-sample loss\. At each round, the server broadcasts the current global model, each selected client performs local optimization steps, and the resulting updates are aggregated into a new global model; the process repeats until convergence\. The final global model is evaluated on each client’s local test set\.

### 2\.4Federated Optimization Strategies

For the two gradient\-based models \(CoxPH and DeepSurv\), three federated aggregation and optimization strategies are evaluated\. The RSF model relies on tree aggregation rather than gradient\-based optimization and is therefore incompatible with these strategies\.

##### FedAvg

Federated Averaging\[[32](https://arxiv.org/html/2606.23871#bib.bib7)\]is the standard FL baseline\. After a few local stochastic\-gradient\-descent steps, the server computes the data\-weighted average of the received parameters:

w\(t\+1\)=∑k∈StnkN​wk\(t\+1\)w^\{\(t\+1\)\}=\\sum\_\{k\\in S\_\{t\}\}\\frac\{n\_\{k\}\}\{N\}\\,w\_\{k\}^\{\(t\+1\)\}\(7\)whereStS\_\{t\}is the set of clients selected at roundtt\. FedAvg is communication\-efficient but assumes that local updates do not diverge excessively, which may not hold under strong non\-iidconditions\.

##### FedProx

FedProx\[[27](https://arxiv.org/html/2606.23871#bib.bib8)\]extends FedAvg by adding a proximal term to each local objective:

minw⁡Fk​\(w\)\+μ2​∥w−w\(t\)∥2\\min\_\{w\}\\;F\_\{k\}\(w\)\+\\frac\{\\mu\}\{2\}\\,\\lVert w\-w^\{\(t\)\}\\rVert^\{2\}\(8\)wherew\(t\)w^\{\(t\)\}is the current global model andμ\\mucontrols the penalty on local drift\. The proximal term limits divergence between local and global models, which is designed to improve robustness under heterogeneous client distributions and variable local computation\.

##### FedAdam

FedAdam\[[38](https://arxiv.org/html/2606.23871#bib.bib9)\]applies adaptive moment estimation on the server side\. Rather than averaging updates directly, the server maintains first\- and second\-moment estimates of the aggregated updateΔt\\Delta\_\{t\}:

mt\\displaystyle m\_\{t\}=β1​mt−1\+\(1−β1\)​Δt\\displaystyle=\\beta\_\{1\}m\_\{t\-1\}\+\(1\-\\beta\_\{1\}\)\\,\\Delta\_\{t\}\(9\)vt\\displaystyle v\_\{t\}=β2​vt−1\+\(1−β2\)​Δt2\\displaystyle=\\beta\_\{2\}v\_\{t\-1\}\+\(1\-\\beta\_\{2\}\)\\,\\Delta\_\{t\}^\{2\}\(10\)whereβ1\\beta\_\{1\}andβ2\\beta\_\{2\}are exponential decay rates, and updates the global model using an Adam\-style rule\[[23](https://arxiv.org/html/2606.23871#bib.bib43)\]\. This adaptive server optimization is intended to improve convergence stability for non\-convex objectives\.

Federated orchestration and communication are implemented using the Flower framework\[[3](https://arxiv.org/html/2606.23871#bib.bib44)\], and the number of communication rounds is derived from the data\-driven formulation provided by FLamby \(Section[2\.5](https://arxiv.org/html/2606.23871#S2.SS5)\)\.

### 2\.5Experimental Setup

To study the effect of data heterogeneity, experiments are performed with three client configurations of five, four, and three clients, mapped consistently to centers\{0,1,2,3,4\}\\\{0,1,2,3,4\\\},\{0,1,2,3\}\\\{0,1,2,3\\\}, and\{0,1,2\}\\\{0,1,2\\\}, respectively\. Reducing the number of clients excludes specific data distributions rather than redistributing data, so each remaining client retains its original distribution\. To account for stochasticity in training, each experiment is repeated ten times with different random seeds, and all reported metrics are given as the mean and standard deviation over these runs\.

For a fair comparison between centralized and federated training, the maximum number of federated rounds is not chosen arbitrarily but computed so that the total number of local update steps matches the centralized training budget\. Following FLamby\[[35](https://arxiv.org/html/2606.23871#bib.bib33)\], givennepochsPn^\{P\}\_\{\\text\{epochs\}\}centralized epochs,nTn\_\{T\}total training samples,KKclients, batch sizeBB, andEElocal update steps:

Tmax=nepochsP​⌊nTK⋅B⋅E⌋T\_\{\\max\}=n^\{P\}\_\{\\text\{epochs\}\}\\left\\lfloor\\frac\{n\_\{T\}\}\{K\\cdot B\\cdot E\}\\right\\rfloor\(11\)For RSF, a separate design is used because of its tree\-based nature: a first experiment selects the number of local trees \(varied among2020,5050,100100, and200200\) while keeping the number of aggregated global trees fixed, and a second experiment evaluates the chosen configuration across paradigms and client counts\. After this analysis, the number of trees is fixed at100100, which balances performance and computational cost\.

Importantly, the held\-out test partition is never used during training or validation\. In federated experiments, evaluation is performed after each communication round, and the final results are reported at the last round, averaged across runs\. For time\-dependent metrics, a client\-specific grid of100100equally spaced time points is defined over the common follow\-up interval shared by the client’s training and test sets\. Let\{t\(1\)tr≤⋯≤t\(n\)tr\}\\\{t^\{\\mathrm\{tr\}\}\_\{\(1\)\}\\leq\\cdots\\leq t^\{\\mathrm\{tr\}\}\_\{\(n\)\}\\\}and\{t\(1\)te≤⋯≤t\(m\)te\}\\\{t^\{\\mathrm\{te\}\}\_\{\(1\)\}\\leq\\cdots\\leq t^\{\\mathrm\{te\}\}\_\{\(m\)\}\\\}denote the ordered observed times in the training and test sets; the grid bounds are:

tmin=max⁡\(t\(2\)tr,t\(2\)te\),tmax=min⁡\(t\(n−1\)tr,t\(m−1\)te\)t\_\{\\min\}=\\max\\\!\\left\(t^\{\\mathrm\{tr\}\}\_\{\(2\)\},t^\{\\mathrm\{te\}\}\_\{\(2\)\}\\right\),\\;t\_\{\\max\}=\\min\\\!\\left\(t^\{\\mathrm\{tr\}\}\_\{\(n\-1\)\},t^\{\\mathrm\{te\}\}\_\{\(m\-1\)\}\\right\)\(12\)and the evaluation grid is𝒯=\{tmin\+k99​\(tmax−tmin\)\}k=099\\mathcal\{T\}=\\\{\\,t\_\{\\min\}\+\\tfrac\{k\}\{99\}\(t\_\{\\max\}\-t\_\{\\min\}\)\\\}\_\{k=0\}^\{99\}\. Restricting evaluation to this common interior interval avoids boundary regions where the risk set is small and estimates based on censored times become unstable\.

Finally, all three models are evaluated under the centralized, local, and federated paradigms for the five\-, four\-, and three\-client configurations\. For the gradient\-based models, the federated setting is additionally evaluated with FedAvg, FedProx, and FedAdam; for clarity, the main results report FedAvg, which is consistently strong and stable, while the comparison across strategies is analyzed separately\. The performance of all models is assessed using the discrimination and calibration metrics described next\.

## 3Results

This section presents the experimental results obtained for the evaluated survival analysis models and FL configurations\. First, the evaluation metrics used in the study are described\. Subsequently, the individual performance of the CoxPH, DeepSurv, and RSF models is analyzed across the different training paradigms to further investigate the impact of distributed learning settings on survival prediction performance, followed by a cross\-model comparison\. Finally, the effect of different federated optimization strategies, namely FedAvg, FedAdam, and FedProx, is analyzed to further assess the robustness of federated optimization under heterogeneous client distributions in the evaluated experimental setting\.

### 3\.1Evaluation Metrics

This section describes the evaluation metrics used to assess the performance of the survival analysis models across different training paradigms and client configurations\. The selected metrics include bothdiscriminationandcalibrationmeasures, which are essential for evaluating the predictive performance of survival models in a comprehensive manner\. Discrimination metrics, such as the Concordance Index \(C\-Index\) and the time\-dependent Area Under the Curve \(AUC\), evaluate the model’s ability to correctly rank individuals according to their risk of experiencing the event\. On the other hand, calibration metrics, such as the Integrated Brier Score \(IBS\), assess the accuracy of the predicted probabilities of survival over time\. By using a combination of these complementary metrics, it is possible to obtain a more complete understanding of the strengths and weaknesses of each model, which can provide a broader and more reliable evaluation of model performance, as each metric captures different aspects of predictive behavior\.

#### 3\.1\.1Concordance Index \(C\-Index\)

Introduced by Harrell et al\.\[[15](https://arxiv.org/html/2606.23871#bib.bib2)\], it measures the ability of a survival model to rank individuals correctly according to their risk of experiencing the event while accounting for censoring\. It is calculated as the proportion of all pairs of comparable individuals for which the survival model assigns a higher risk to the individual who experiences the event first\. A C\-Index of 0\.5 is equivalent to random ranking, while a C\-Index of 1\.0 is equivalent to perfect discrimination\. The C\-Index is defined as:

C\-Index=∑i∑j𝕀​\(ri\>rj\)​𝕀​\(Ti<Tj\)​𝕀​\(δi=1\)∑i∑j𝕀​\(Ti<Tj\)​𝕀​\(δi=1\)\\text\{C\-Index\}=\\frac\{\\sum\_\{i\}\\sum\_\{j\}\\mathbb\{I\}\(r\_\{i\}\>r\_\{j\}\)\\mathbb\{I\}\(T\_\{i\}<T\_\{j\}\)\\mathbb\{I\}\(\\delta\_\{i\}=1\)\}\{\\sum\_\{i\}\\sum\_\{j\}\\mathbb\{I\}\(T\_\{i\}<T\_\{j\}\)\\mathbb\{I\}\(\\delta\_\{i\}=1\)\}\(13\)whererir\_\{i\}andrjr\_\{j\}represent the predicted risk scores for individualsiiandjjrespectively,TiT\_\{i\}andTjT\_\{j\}are the observed event or censoring times,δi\\delta\_\{i\}is the event indicator \(1 if the event occurred, 0 if censored\), and𝕀​\(⋅\)\\mathbb\{I\}\(\\cdot\)is the indicator function that equals 1 if the condition is true and 0 otherwise\.

#### 3\.1\.2Time\-Dependent Area Under the Curve \(AUC\)

Based on the cumulative/dynamic ROC formulation proposed by Heagerty et al\.\[[16](https://arxiv.org/html/2606.23871#bib.bib3)\], it evaluates the model’s capacity to distinguish between individuals who have experienced the event before a certain time and those who have not experienced the event, accounting for censoring\. In this work, the AUC is reported as the time\-integrated cumulative dynamic AUC evaluated on the time grid of the evaluation times\. The time\-dependent AUC at timettis defined as:

AUC​\(t\)=ℙ​\(ri​\(t\)\>rj​\(t\)​\|Ti≤t,δi=1,Tj\>​t\)\\text\{AUC\}\(t\)=\\mathbb\{P\}\\left\(r\_\{i\}\(t\)\>r\_\{j\}\(t\)\\;\\middle\|\\;T\_\{i\}\\leq t,\\;\\delta\_\{i\}=1,\\;T\_\{j\}\>t\\right\)\(14\)whereiiandjjindex two distinct individuals,ri​\(t\)r\_\{i\}\(t\)represents the time\-dependent predicted risk score for individualiiat timett, andTiT\_\{i\}is the observed event or censoring time for individualii\. The condition thatTi≤tT\_\{i\}\\leq tfor individualiiimplies that individualiiexperienced the event before or at timett, and thatTj\>tT\_\{j\}\>tfor individualjjimplies that individualjjis event\-free at timett\. The time\-integrated AUC is then calculated by integrating the time\-dependent AUC over the specified time interval:

AUCint=1tmax−tmin​∫tmintmaxAUC​\(t\)​𝑑t\\text\{AUC\}\_\{\\text\{int\}\}=\\frac\{1\}\{t\_\{\\max\}\-t\_\{\\min\}\}\\int\_\{t\_\{\\min\}\}^\{t\_\{\\max\}\}\\text\{AUC\}\(t\)\\,dt\(15\)wheretmint\_\{\\min\}andtmaxt\_\{\\max\}specify the window of evaluation\. The integrated AUC is a summary of the average discriminative ability of the model over the interval fromtmint\_\{\\min\}totmaxt\_\{\\max\}\.

#### 3\.1\.3Integrated Brier Score \(IBS\)

Defined by Graf et al\.\[[13](https://arxiv.org/html/2606.23871#bib.bib4)\], it assesses the accuracy of the predicted probabilities of survival over time\. It calculates the squared error of the predicted probabilities of survival and the observed outcome and integrates this error over the evaluation time interval\. A lower value of IBS is an indicator of a more accurate and better\-calibrated survival model\. The Brier score at timettis given by:

BS​\(t\)=1N​∑i=1n\(𝕀​\(Ti\>t\)−S^i​\(t\)\)2\\text\{BS\}\(t\)=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{n\}\\left\(\\mathbb\{I\}\(T\_\{i\}\>t\)\-\\hat\{S\}\_\{i\}\(t\)\\right\)^\{2\}\(16\)whereNNis the total number of individuals,TiT\_\{i\}is the observed event or the censoring time for individualii,S^i​\(t\)\\hat\{S\}\_\{i\}\(t\)is the predicted survival probability for individualiiat timett, and𝕀​\(Ti\>t\)\\mathbb\{I\}\(T\_\{i\}\>t\)is the indicator function, equal to 1 if individualiiis event\-free at timettand 0 otherwise\. The Integrated Brier Score \(IBS\) is then given by:

IBS=1tmax−tmin​∫tmintmaxBS​\(t\)​𝑑t\\text\{IBS\}=\\frac\{1\}\{t\_\{\\max\}\-t\_\{\\min\}\}\\int\_\{t\_\{\\min\}\}^\{t\_\{\\max\}\}\\text\{BS\}\(t\)\\,dt\(17\)wheretmint\_\{\\min\}andtmaxt\_\{\\max\}define the evaluation time interval\. The IBS represents the average squared prediction error over\[tmin,tmax\]\[t\_\{\\min\},t\_\{\\max\}\], providing a scalar summary of overall predictive accuracy\. Lower values indicate better predictive performance and calibration\.

### 3\.2Model Performance Under Different Training Paradigms

This section presents a detailed analysis of the performance of the CoxPH, DeepSurv, and RSF models across thelocal,federated, andcentralizedtraining paradigms for different client configurations\. The analysis aims to evaluate the effect of distributed learning settings on survival prediction performance, while also examining client\-level variability under heterogeneous data distributions\. Finally, a client\-level cross\-model comparison is conducted under the federated learning setting to analyze the relative predictive performance of the evaluated survival models across different clients and evaluation metrics\. The complete per\-client results underlying these analyses are reported in Appendix[6](https://arxiv.org/html/2606.23871#S6)\.

#### 3\.2\.1CoxPH

The average performance of the CoxPH models using centralized, federated, and local training methods across different numbers of clients is presented in Table[2](https://arxiv.org/html/2606.23871#S3.T2)\. The best discrimination performance can be obtained when using the centralized training method in terms of both C\-Index and AUC scores\. When five clients are present, the performance of centralized training is measured at 0\.732 on the C\-Index, in contrast to federated and local training performances of 0\.611 and 0\.598, respectively\. A similar trend in terms of AUC was observed, where centralized training obtained a value of 0\.697 compared to federated \(0\.600\) and local \(0\.606\) training\.

Regarding calibration performance, federated training achieved the lowest IBS values across all client configurations\. For instance, with five clients, federated training obtained an IBS of 0\.155, compared to 0\.167 for local training and 0\.290 for centralized training\. As the number of clients decreased from five to three, there were minor changes in results across all training paradigms\. In general, it was observed that C\-Index and AUC values remained relatively stable, whereas IBS values increased for local and federated training\.

Table 2:Mean CoxPH Performance Across Training Paradigms and Number of ClientsNote:∗Values do not exhibit statistically significant differences with Centralized according to Dunn’s post\-hoc test \(p\>0\.05p\>0\.05\)\.

The variation in performance of the CoxPH algorithm from the point of view of clients is illustrated in Fig\.[1](https://arxiv.org/html/2606.23871#S3.F1)under the five\-client configuration\. This setup was selected as it represents the scenario with the highest number of participating clients\. Variation between performance can be seen among all the metrics under all the training paradigms\. Clients C0 and C4 achieved the highest C\-Index and AUC values, respectively, for most of the training paradigms; whereas it was observed that client C1 achieved a lower performance\. Variability across runs, represented by the error bars, was higher for certain clients, particularly under local and federated training\. Lastly, across several clients, federated training tended to outperform local training on discrimination metrics, though this advantage was not consistent for all clients and metrics\.

![Refer to caption](https://arxiv.org/html/2606.23871v1/x1.png)Figure 1:Client\-level performance variability across training paradigms for CoxPH under the five\-client configuration\. Results are reported as mean±\\pmstandard deviation across 10 independent runs for each client: \(a\) C\-Index, \(b\) AUC, and \(c\) IBS\.
#### 3\.2\.2DeepSurv

Table[3](https://arxiv.org/html/2606.23871#S3.T3)summarizes the performance of the DeepSurv model under local, federated, and centralized training paradigms across different client configurations\. Federated training achieved the highest discrimination performance for most client configurations, obtaining the highest C\-Index and AUC values\. When considering the five\-client configuration, federated training reached a C\-Index of 0\.727, compared to 0\.707 for centralized training and 0\.618 for local training\. A similar trend was observed for AUC, where federated training achieved a value of 0\.702, outperforming centralized and local learning, whose scores were 0\.684 and 0\.620, respectively\.

In terms of IBS, federated training also obtained the lowest values across all client configurations\. For instance, with five clients, federated training achieved an IBS of 0\.148, compared to 0\.159 for local training and 0\.269 for centralized training\. As the number of clients decreased from five to three, minor differences in performance were observed across all paradigms\. Overall, federated learning offered strong discrimination performance while consistently achieving the lowest IBS values across all client configurations\.

Table 3:Mean DeepSurv Performance Across Training Paradigms and Number of ClientsNote:∗Values do not exhibit statistically significant differences with Centralized according to Dunn’s post\-hoc test \(p\>0\.05p\>0\.05\)\.

Fig\.[2](https://arxiv.org/html/2606.23871#S3.F2)illustrates the client\-level results obtained by the DeepSurv model for the five\-client configuration\. Differences in performance among clients were observed for all evaluated metrics and training paradigms\. Client C4 obtained the highest C\-Index and AUC values, whereas lower discrimination performance was observed for client C3\. On the other hand, for the IBS metric, higher values were observed for client C1 under local and federated training\. Variability across runs, represented by the error bars, was more pronounced for clients C1 and C4 across the discrimination metrics\. Lastly, federated training achieved higher discrimination performance than local training for several clients, particularly for clients C1, C2, and C4 in terms of C\-Index and AUC\.

![Refer to caption](https://arxiv.org/html/2606.23871v1/x2.png)Figure 2:Client\-level performance variability across training paradigms for DeepSurv under the five\-client configuration\. Results are reported as mean±\\pmstandard deviation across 10 independent runs for each client: \(a\) C\-Index, \(b\) AUC, and \(c\) IBS\.
#### 3\.2\.3Random Survival Forest \(RSF\)

The performance obtained by the RSF model under local, federated, and centralized training paradigms is reported in Table[4](https://arxiv.org/html/2606.23871#S3.T4)\. For consistency, the number of trees was fixed at 100 across all RSF configurations after preliminary experiments on different numbers of trees \(i\.e\., 50, 100, and 200\) showed that this setting provided a good balance between performance and computational efficiency\. Centralized training achieved the highest C\-Index values across all client configurations, reaching 0\.746 for the five\-client configuration, compared to 0\.724 for federated training and 0\.687 for local training\. A similar trend was observed for AUC, where centralized training obtained the highest values for all configurations\.

For the IBS metric, federated training consistently achieved the lowest values across all client configurations\. For the five\-client configuration, federated learning achieved an IBS of 0\.138, compared to 0\.152 for local learning and 0\.245 for centralized learning\. As the number of clients decreased from five to three, variations in performance were observed across all paradigms\. In general, IBS values increased as the number of clients decreased, whereas C\-Index and AUC showed moderate fluctuations\.

Table 4:Mean RSF Performance Across Training Paradigms and Number of Clients \(100\-Tree Configuration\)Note:∗Values do not exhibit statistically significant differences with Centralized according to Dunn’s post\-hoc test \(p\>0\.05p\>0\.05\)\.

Fig\.[3](https://arxiv.org/html/2606.23871#S3.F3)shows the client\-level performance variability of the RSF algorithm under the five\-client configuration with 100 trees\. Differences across clients can be observed for all metrics and training paradigms\. Client C4 obtained the highest C\-Index and AUC values across all paradigms, while lower discrimination performance was observed for client C1\. In terms of IBS, higher values were obtained by client C1, especially under the centralized training method\. The error bars indicate variability across runs, which was more pronounced for client C1 across the discrimination metrics\. Lastly, federated training achieved higher discrimination performance than local training for several clients, particularly for clients C1 and C3 in terms of C\-Index and AUC\.

![Refer to caption](https://arxiv.org/html/2606.23871v1/x3.png)Figure 3:Client\-level performance variability across training paradigms for RSF under the five\-client configuration \(100 trees\)\. Results are reported as mean±\\pmstandard deviation across 10 independent runs for each client: \(a\) C\-Index, \(b\) AUC, and \(c\) IBS\.
#### 3\.2\.4Cross\-Model Comparison under Federated Learning

The cross\-model comparison is performed with a focus on FL, as this is considered to be the main scenario of interest for this work\. By comparing the models in this scenario, it is possible to assess their efficiency in a real environment, as well as their robustness with respect to data heterogeneity and splitting\.

In order to compare the efficiency of the evaluated models, it is necessary to consider different numbers of clients \(i\.e\., 3, 4, and 5\)\. For a better understanding, Fig\.[4](https://arxiv.org/html/2606.23871#S3.F4)shows a graphical representation of the performance for 5 clients\. This choice has been made as it represents the scenario with the highest number of participating clients, for whom the performance of the models tends to be higher\. The performance for 3 and 4 clients has similar characteristics, as presented in Table[5](https://arxiv.org/html/2606.23871#S3.T5)\.

As demonstrated in Fig\.[4](https://arxiv.org/html/2606.23871#S3.F4), performance variations between the models are observed across all metrics, with the black error bars indicating standard deviations for each setup\. In terms of the C\-Index, RSF and DeepSurv outperform CoxPH for all clients\. For clients C0 and C4, RSF has the highest C\-Index, while for client C2, DeepSurv has comparable performance\. For all clients, CoxPH has the lowest C\-Index\. Variability is higher for CoxPH for several clients, such as C1 and C3, while RSF and DeepSurv have stable performance\.

The same pattern is followed by the AUC metric\. RSF and DeepSurv report better results compared to CoxPH, with RSF having better results for clients C0 and C4, and DeepSurv having better results for client C1\. CoxPH reports the lowest results for this metric for most clients\. Regarding the standard deviation, we can see that CoxPH exhibits higher standard deviations for the clients, while RSF and DeepSurv are more consistent\.

For the IBS metric, RSF reports the lowest results compared to the other methods for most of the clients, such as C1 and C3\. DeepSurv reports similar results but with a slightly higher value\. On the other hand, CoxPH reports higher results for this metric, especially for client C1\. Regarding the standard deviation, we can see that it is smaller compared to the previous metrics, with CoxPH having a slightly higher standard deviation\.

![Refer to caption](https://arxiv.org/html/2606.23871v1/x4.png)Figure 4:Client\-level performance comparison of RSF, DeepSurv, and CoxPH under federated learning for the five\-client configuration\. Results are reported as mean±\\pmstandard deviation across 10 independent runs for each client: \(a\) C\-Index, \(b\) AUC, and \(c\) IBS\.The average results reported in Table[5](https://arxiv.org/html/2606.23871#S3.T5)confirm these observations\. Across all numbers of clients, RSF and DeepSurv obtain higher C\-Index and AUC values than CoxPH, and the relative ordering of the models remains consistent\. In terms of IBS, RSF achieves the lowest values for all configurations, followed by DeepSurv, while CoxPH consistently shows higher values\.

Table 5:Average Federated Performance Across Models and Number of ClientsNote:Mean performance across clients across runs and client configurations\.

### 3\.3Performance of Federated Optimization Strategies

The impact of different federated aggregation and optimization strategies on model performance was evaluated for the CoxPH and DeepSurv models using FedAvg, FedAdam, and FedProx\. In contrast, the RSF model follows a different federated optimization approach due to its tree\-based nature, relying on aggregation instead of gradient\-based optimization\. Therefore, the above\-mentioned optimization strategies were not applicable to RSF\.

Tables[6](https://arxiv.org/html/2606.23871#S3.T6)and[7](https://arxiv.org/html/2606.23871#S3.T7)present the average performance obtained using different federated aggregation and optimization strategies for the CoxPH and DeepSurv models, respectively\. From the results presented below, it can be observed that FedAvg and FedProx achieved comparable results across all client configurations, whereas FedAdam consistently yielded lower performance\.

Regarding the CoxPH model, FedAvg obtained the highest C\-Index and AUC values for the five\-client configuration, reaching 0\.611 and 0\.600, respectively\. FedProx showed a very similar behavior, with only marginal differences across all metrics and client configurations\. On the other hand, FedAdam consistently produced lower discrimination performance and slightly higher IBS values, especially for the three\-client configuration, where the IBS increased to 0\.217 compared to 0\.213 for both FedAvg and FedProx\. Taken together, the results suggest that FedAvg and FedProx provide more stable optimization behavior for CoxPH under heterogeneous client distributions in this experimental setting\.

A similar trend was observed for DeepSurv, where FedAvg and FedProx achieved comparable and consistently strong performance across all client configurations\. FedAvg obtained the highest C\-Index and AUC values for the five\-client configuration \(0\.727 and 0\.702, respectively\), whereas FedProx performed best in the three\-client configuration, reaching a C\-Index of 0\.748 and an AUC of 0\.676 and thereby outperforming both FedAvg and FedAdam\. Conversely, FedAdam again produced the lowest discrimination metrics and the highest IBS values, particularly for the three\-client setting\. These findings indicate that proximal regularization \(FedProx\) can be beneficial for deep learning\-based survival models under stronger client heterogeneity, while FedAvg remains a competitive and stable baseline; FedAdam, in turn, appears less robust in this experimental setting\.

Table 6:Average CoxPH Performance Across Federated Learning Strategies and Client ConfigurationsNote:Mean performance across clients across runs and client configurations\.

Table 7:Average DeepSurv Performance Across Federated Learning Strategies and Client ConfigurationsNote:Mean performance across clients across runs and client configurations\.

## 4Discussion

This section discusses the main findings extracted from the experimental evaluation of the proposed federated survival analysis framework\. First, a comparative analysis of the different training paradigms and survival models is presented\. Then, the impact of federated optimization strategies on model performance and robustness is examined\. Following this, the influence of data heterogeneity and client\-level variability is analyzed\. Lastly, practical guidelines for federated survival modeling are provided based on the observed experimental behavior\.

### 4\.1Comparative Analysis of Training Paradigms and Survival Models

A comparison between local, federated, and centralized training strategies shows that there are significant differences among the training techniques in terms ofhowthe survival models learn from distributed data and generalize over clients\. Centralized training generally provides strong discrimination performance, as the model is trained on the entire dataset, allowing it to capture global correlations and reduce statistical variability\. This behavior is consistently observed for CoxPH and RSF, where centralized training achieves the highest or near\-highest performance in terms of C\-Index and AUC \(see Tables[2](https://arxiv.org/html/2606.23871#S3.T2)and[4](https://arxiv.org/html/2606.23871#S3.T4)\)\. However, this trend is not uniform across all families\. For DeepSurv, FL achieves comparable or even superior discrimination performance \(see Table[3](https://arxiv.org/html/2606.23871#S3.T3)\)\. This may be attributed to the learning characteristics of Deep Neural Networks \(DNN\)\. In contrast to classical models, the DeepSurv model can benefit from the dynamics of federated training, where the aggregation of heterogeneous client updates introduces additional variance in the optimization process\[[29](https://arxiv.org/html/2606.23871#bib.bib6)\]\. This stochasticity can have a regularizing effect and may help improve generalization, therefore allowing federated models to outperform centralized training in certain cases\.

On the other hand, FL provides a trade\-off between performance and privacy by enabling collaborative training without sharing raw data\. Although it cannot directly use the entire dataset, it leverages information obtained from different clients via parameter aggregation\. As shown in Tables[2](https://arxiv.org/html/2606.23871#S3.T2),[3](https://arxiv.org/html/2606.23871#S3.T3), and[4](https://arxiv.org/html/2606.23871#S3.T4), federated models generally outperform local models in terms of C\-Index and AUC, benefiting from exposure to a broader range of data distributions, as reflected in Table[11](https://arxiv.org/html/2606.23871#S6.T11)\. Nevertheless, the ability to perform well is dependent on the similarity between client distributions, since if the datasets are diverse, the federated model might fail to adequately account for the particularities of the individual clients\. Thus, performance will vary across clients, as depicted in Figures[1](https://arxiv.org/html/2606.23871#S3.F1),[2](https://arxiv.org/html/2606.23871#S3.F2), and[3](https://arxiv.org/html/2606.23871#S3.F3)\.

Local learning generally yields the lowest performance among the three training approaches\. This result was expected, since each model is trained using a limited amount of data and does not benefit from shared information across clients\. As shown in Table[10](https://arxiv.org/html/2606.23871#S6.T10), local models are more likely to overfit to the specific characteristics of the client and may fail to generalize effectively\. This problem is especially evident when clients have scarce data and skewed distributions\.

Furthermore, the comparative analysis of the different models demonstrates notable differences in their ability to handle heterogeneous datasets in federated environments\. In particular, RSF and DeepSurv achieved better discrimination performance than CoxPH across most evaluated configurations\. This behavior may be explained by the higher representational capacity of these models, which enables them to capture complex nonlinear relationships and interactions between covariates and survival outcomes\. In contrast, CoxPH relies on proportional hazards assumptions and linear risk modeling, which limits its flexibility when client data distributions vary substantially\.

Regarding calibration, RSF generally provided the most reliable survival probability estimates, followed by DeepSurv, whereas CoxPH tended to exhibit greater variability and instability\. This behavior highlights the robustness of tree\-based ensemble methods under heterogeneous data conditions\. Although DeepSurv demonstrated strong predictive capability, its calibration performance was more sensitive to optimization dynamics and data imbalance\. Similarly, the dependence of CoxPH on globally consistent risk sets may increase its sensitivity to distributional imbalances across clients\. In terms of robustness, RSF showed the most stable performance across clients and training paradigms, likely due to the variance\-reduction properties of ensemble learning methods\. DeepSurv also achieved strong results, although with slightly higher variability caused by its dependence on optimization and initialization\. In contrast, CoxPH appeared to be more sensitive to distributional shifts, resulting in greater variability across client configurations\.

Overall, these findings indicate that model selection plays a critical role in federated survival analysis\. RSF provided the best balance between discrimination, calibration, and robustness across distributed settings, whereas DeepSurv offered strong predictive capability with moderate sensitivity to optimization dynamics\. CoxPH remained advantageous primarily in more homogeneous and interpretability\-oriented scenarios\. These results highlight a clear trade\-off between data availability and the efficiency of the models being trained\. Centralized training remains the most effective solution when data centralization is possible, owing to its superior ability to learn global patterns\. FL provides a strong alternative under privacy constraints, improving over local training while maintaining data decentralization, although its effectiveness is influenced by client heterogeneity\. Local training, while privacy\-preserving, is limited by data scarcity and lack of generalization\.

### 4\.2Impact of Federated Optimization Strategies

The impact of the federated optimization strategies was assessed for the two gradient\-based models, CoxPH and DeepSurv, since RSF aggregates decision trees rather than gradient updates and is therefore incompatible with these strategies\. As summarized in Table[8](https://arxiv.org/html/2606.23871#S4.T8)and detailed in Tables[6](https://arxiv.org/html/2606.23871#S3.T6)and[7](https://arxiv.org/html/2606.23871#S3.T7), FedAvg and FedProx provided the most competitive and stable performance across all client configurations, whereas FedAdam consistently yielded inferior discrimination performance\.

For the CoxPH model, FedAvg and FedProx showed very similar behavior across all evaluated configurations, indicating that both strategies are appropriate choices for federated implementations of traditional statistical survival models\. On the other hand, the lower performance observed with FedAdam suggests that adaptiveserver\-sideoptimization did not provide clear advantages under the evaluated experimental conditions, which is consistent with the limited benefit that adaptive methods typically offer for the convex, low\-dimensional optimization underlying the Cox partial likelihood\.

For DeepSurv, FedProx and FedAvg were closely matched and clearly ahead of FedAdam\. FedAvg obtained the best discrimination performance in the five\-client configuration, whereas FedProx achieved the highest overall scores in the three\-client configuration \(a C\-Index of 0\.748 and an AUC of 0\.676\)\. The proximal term used by FedProx is designed to limit client drift when local updates diverge under heterogeneous data, consistent with its characterization in Table[8](https://arxiv.org/html/2606.23871#S4.T8); in this study, however, its advantage over a well\-tuned FedAvg remained marginal\.

Altogether, these findings suggest that simpler aggregation strategies such as FedAvg can already achieve competitive performance for federated survival analysis, while FedProx can provide additional robustness for more complex neural\-network\-based models when client distributions diverge\. In contrast, FedAdam did not demonstrate clear advantages in the evaluated experimental scenarios and would likely require careful hyperparameter tuning to become competitive\.

Table 8:Summary of the Evaluated Federated Optimization Strategies
### 4\.3Data Heterogeneity

The effects of heterogeneity in the data are studied by varying the number of participating clients, thereby effectively limiting the set of distributions considered for the training procedure\. In this case, each client keeps its original distribution, while the reduction in the number of clients implies the exclusion of certain data distributions from consideration rather than redistributing data across clients\. An interesting result is that the performance of the model does not decrease monotonically with a decrease in the number of clients\. This can be observed across all models, as shown in Tables[2](https://arxiv.org/html/2606.23871#S3.T2),[3](https://arxiv.org/html/2606.23871#S3.T3), and[4](https://arxiv.org/html/2606.23871#S3.T4), where no consistent improvement or degradation is observed when moving from 5 to 3 clients\. This indicates that the performance of the model does not depend on the number of clients themselves but rather on the type of data distribution used for training the model\. With an increase in the number of clients, a greater variety of data in terms of feature distribution, survival pattern, and censoring is taken into account\.

This becomes especially apparent from the way calibration behaves\. As fewer clients participate in the training, calibration performance degrades for all training modes, as reflected by the generally increasing IBS values in Tables[2](https://arxiv.org/html/2606.23871#S3.T2),[3](https://arxiv.org/html/2606.23871#S3.T3), and[4](https://arxiv.org/html/2606.23871#S3.T4)\. This might suggest that the estimation of survival probabilities is sensitive to the diversity of observed survival times and event patterns\. When certain clients are excluded, the model may fail to capture important regions of the time\-to\-event distribution, which can lead to less reliable probability estimates\. On the other hand, the discrimination task performs quite consistently regardless of whether clients differ in their structure\. As shown in Tables[2](https://arxiv.org/html/2606.23871#S3.T2),[3](https://arxiv.org/html/2606.23871#S3.T3), and[4](https://arxiv.org/html/2606.23871#S3.T4), C\-Index and AUC values exhibit only minor fluctuations when the number of clients changes\. The reason for this may be that ranking clients based on their risk is less sensitive to the presence or absence of specific data distributions\. Ranking depends on the relative ordering of individual risks, which can still be recovered even when the diversity of the data is reduced\. Calibration, on the other hand, depends more on capturing the full underlying distribution\.

In addition to this, the variations seen amongst the clients also point to the differences present in the distribution of data for the individual clients\. As shown in Table[11](https://arxiv.org/html/2606.23871#S6.T11), some clients consistently achieve better performance, whereas others display poorer performance and greater variability\. A notable exception is observed for client C1, where centralized training shows significantly higher IBS values\. This behavior is evident in Table[12](https://arxiv.org/html/2606.23871#S6.T12)\. This could be because longer survival times exist in client C1, as it includes some of the highest event times that have been observed\. Specifically, there are very late event occurrences in client C1 that have occurred beyond 4000 days, but do not exist in the test data\. Because centralized learning aggregates information from all clients, the model learns to assign a nonzero risk score at much later time points than the other clients, leading to poorer calibration in this case\. A similar behavior is also observed for the CoxPH and DeepSurv models \(see Sections[3\.2\.1](https://arxiv.org/html/2606.23871#S3.SS2.SSS1)and[3\.2\.2](https://arxiv.org/html/2606.23871#S3.SS2.SSS2)\), indicating that this effect is not specific to the RSF model, but rather related to the underlying data distribution\. Therefore, it can be suggested that different characteristics such as sample size, censoring ratio, and feature distribution vary among the clients\. In such scenarios, excluding specific clients may significantly affect the global model, particularly if they contain unique or extreme patterns\.

Moreover, the observed differences between clients suggest that the specific properties of the data, such as censoring levels, event scarcity, and institutional variance, play a critical role when evaluating model performance\. Clients that have few events observed or exhibit high censoring ratios contribute less information to the learning process, which can result in poor discrimination and calibration of the model\. In a similar way, institutional imbalance due to unequal distribution of data among different clients may result in a biased global model\. In summary, the influence of data heterogeneity appears to depend more on the diversity of the participating clients and how well they represent different types of data distributions, and not necessarily their number\.

### 4\.4Practical Guidelines

Based on the results obtained through the different experiments, certain practical recommendations can be drawn regarding how to choose proper models and how to conduct training when conducting federated survival analysis\. These recommendations are summarized in Table[9](https://arxiv.org/html/2606.23871#S4.T9)and discussed below, with each guideline grounded in the behavior observed under the different data and training conditions\.

Table 9:Practical Guidelines for Federated Survival Modeling##### Heterogeneous clients

When client distributions differ substantially, RSF with federated training is recommended\. This is supported by the results in Section[3\.2\.3](https://arxiv.org/html/2606.23871#S3.SS2.SSS3)and the cross\-model comparison \(Section[3\.2\.4](https://arxiv.org/html/2606.23871#S3.SS2.SSS4)\), where RSF consistently shows stable performance across clients and lower variability compared to CoxPH and DeepSurv\. In particular, RSF maintains competitive discrimination while showing robustness to distribution shifts, as demonstrated by the relatively small performance gaps between federated and centralized training \(Table[4](https://arxiv.org/html/2606.23871#S3.T4)\) and its stability across clients in Figure[3](https://arxiv.org/html/2606.23871#S3.F3)\.

##### Homogeneous data

For homogeneous datasets, CoxPH is a suitable choice, since its underlying assumptions are more likely to hold\. As discussed in Section[3\.2\.4](https://arxiv.org/html/2606.23871#S3.SS2.SSS4), CoxPH performs suitably when data distributions are consistent, but its performance degrades under heterogeneity due to its linear structure and reliance on proportional hazards\. This is also reflected in the higher variability observed across clients in the CoxPH blocks of Tables[11](https://arxiv.org/html/2606.23871#S6.T11),[10](https://arxiv.org/html/2606.23871#S6.T10), and[12](https://arxiv.org/html/2606.23871#S6.T12)\.

##### Scarce or imbalanced data

When data are scarce or imbalanced across clients, FL with RSF or DeepSurv is preferable\. The results of the experiments presented in Sections[3\.2\.1](https://arxiv.org/html/2606.23871#S3.SS2.SSS1)and[3\.2\.2](https://arxiv.org/html/2606.23871#S3.SS2.SSS2)show that federated training improves performance over local models, particularly for weaker clients such as C1, where noticeable gains in C\-Index and AUC are observed \(Figures[1](https://arxiv.org/html/2606.23871#S3.F1)and[2](https://arxiv.org/html/2606.23871#S3.F2)\)\. This indicates that aggregation helps mitigate data scarcity by leveraging information from other clients\.

##### Calibration\-critical applications

When calibration is critical, RSF with federated training should be chosen\. According to Table[5](https://arxiv.org/html/2606.23871#S3.T5)from Section[3\.2\.4](https://arxiv.org/html/2606.23871#S3.SS2.SSS4), RSF has the lowest IBS among all options considered, thus demonstrating its superior accuracy in survival probability estimation compared with the other models, such as DeepSurv and CoxPH\.

##### Interpretability\-focused studies

When interpretability matters, CoxPH remains the preferred model\. As discussed in Section[4\.1](https://arxiv.org/html/2606.23871#S4.SS1), it provides interpretable coefficients and hazard ratios, making it well suited for clinical settings despite its lower predictive performance\.

##### Strict privacy constraints

When strict privacy requirements must be met, FL with RSF or DeepSurv is recommended\. All the results from different experiments show that the federated model is better than the local one and that it can approximate centralized performance \(Sections[3\.2\.2](https://arxiv.org/html/2606.23871#S3.SS2.SSS2)and[3\.2\.3](https://arxiv.org/html/2606.23871#S3.SS2.SSS3)\), therefore making them a practical solution when data sharing is not possible\.

##### Limited computational resources

Finally, in environments with limited computational resources, CoxPH is a practical option due to its lower computational complexity\. This is due to the fact that as a linear model, it requires less training time and fewer resources compared to RSF and DeepSurv, while still providing acceptable performance in simpler scenarios\.

Nonetheless, it is important to note that this discussion is specific to the dataset used in this study, and the comparative effectiveness of the models may differ under other data characteristics, such as variable distributions, censoring effects, and client heterogeneity, as well as application constraints like interpretability, privacy, and computational cost\.

## 5Conclusion

This paper presented a systematic, multi\-model evaluation of federated survival analysis on the Fed\-TCGA\-BRCA dataset, a cross\-institutional breast cancer cohort with naturally heterogeneous clients\. Federated training consistently outperformed local training and approached centralized performance without sharing raw data, occasionally exceeding it for DeepSurv, where aggregation of heterogeneous updates appears to act as a regularizer\. Among the models, RSF offered the best overall balance of discrimination, calibration, and robustness across clients, whereas CoxPH remained competitive mainly in homogeneous and interpretability\-oriented settings\. Performance was governed by the diversity of client distributions rather than their number, and FedAvg and FedProx proved more stable than FedAdam in this setting\. Building on these observations, the study’s main practical contribution is a set of decision\-oriented guidelines that map data, privacy, interpretability, and resource constraints to suitable model and paradigm choices for federated survival modeling in privacy\-constrained healthcare environments\.

Several directions remain open\. Larger federations with more data could improve stability and calibration, which was most sensitive to client exclusion, while extreme or unbalanced distributions motivate data\-balancing strategies and more robust estimators\. Personalized or hybrid federated schemes may better accommodate strong heterogeneity, and a closer analysis of communication efficiency would clarify scalability to larger deployments\. Finally, integrating formal privacy\-preserving mechanisms such as differential privacy, and improving the interpretability of RSF and DeepSurv, would further strengthen the clinical adoption of federated survival models\.

\\appendices

## 6Client\-Level Performance Results

This appendix reports the detailed per\-client performance results that complement the aggregated values discussed in the main text\. All values are reported as mean±\\pmstandard deviation over 10 independent runs\. Table[10](https://arxiv.org/html/2606.23871#S6.T10)reports the per\-client performance under local training, identical regardless of the number of participating clients since each client is trained independently; Table[11](https://arxiv.org/html/2606.23871#S6.T11)reports the per\-client performance under federated training; and Table[12](https://arxiv.org/html/2606.23871#S6.T12)reports the per\-client performance under centralized training\.

Table 10:Local Per\-Client PerformanceNote:Reported values correspond to mean±\\pmstd\. deviation over 10 runs\.

Table 11:Federated Per\-Client Performance Across Different Numbers of ClientsNote:Reported values correspond to mean±\\pmstandard deviation over 10 runs\.

Table 12:Centralized Per\-Client Performance Across Different Numbers of ClientsNote:Reported values correspond to mean±\\pmstandard deviation over 10 runs\.

## References

## References

- \[1\]M\. Andreux, A\. Manoel, R\. Menuet, C\. Saillard, and C\. Simpson\(2020\)Federated survival analysis with discrete\-time cox models\.arXiv preprint arXiv:2006\.08997\.Cited by:[§1](https://arxiv.org/html/2606.23871#S1.p5.1),[§2\.2\.1](https://arxiv.org/html/2606.23871#S2.SS2.SSS1.p1.3)\.
- \[2\]A\. Archetti, F\. Ieva, and M\. Matteucci\(2023\)Scaling survival analysis in healthcare with federated survival forests: a comparative study on heart failure and breast cancer genomics\.Future Generation Computer Systems149,pp\. 343–358\.Cited by:[§1](https://arxiv.org/html/2606.23871#S1.p5.1),[§2\.2\.3](https://arxiv.org/html/2606.23871#S2.SS2.SSS3.p1.5)\.
- \[3\]D\. J\. Beutel, T\. Topal, A\. Mathur, X\. Qiu, J\. Fernandez\-Marques, Y\. Gao, L\. Sani, K\. H\. Li, T\. Parcollet, P\. P\. B\. de GusmÃĢo,et al\.\(2020\)Flower: a friendly federated learning research framework\.arXiv preprint arXiv:2007\.14390\.Cited by:[§2\.4](https://arxiv.org/html/2606.23871#S2.SS4.SSS0.Px3.p2.1)\.
- \[4\]L\. Breiman\(2001\)Random forests\.Machine learning45\(1\),pp\. 5–32\.Cited by:[§2\.2\.3](https://arxiv.org/html/2606.23871#S2.SS2.SSS3.p1.5)\.
- \[5\]I\. Y\. Chen, E\. Pierson, S\. Rose, S\. Joshi, K\. Ferryman, and M\. Ghassemi\(2021\)Ethical machine learning in healthcare\.Annual review of biomedical data science4\(1\),pp\. 123–144\.Cited by:[§1](https://arxiv.org/html/2606.23871#S1.p2.1)\.
- \[6\]Y\. Chen, W\. Lu, X\. Qin, J\. Wang, and X\. Xie\(2023\)Metafed: federated learning among federations with cyclic knowledge distillation for personalized healthcare\.IEEE Transactions on Neural Networks and Learning Systems35\(11\),pp\. 16671–16682\.Cited by:[§1](https://arxiv.org/html/2606.23871#S1.p5.1)\.
- \[7\]T\. G\. Clark, M\. J\. Bradburn, S\. B\. Love, and D\. G\. Altman\(2003\)Survival analysis part i: basic concepts and first analyses\.British journal of cancer89\(2\),pp\. 232–238\.Cited by:[§1](https://arxiv.org/html/2606.23871#S1.p4.1)\.
- \[8\]D\. R\. Cox\(1972\)Regression models and life\-tables\.Journal of the royal statistical society: Series B \(methodological\)34\(2\),pp\. 187–202\.Cited by:[§2\.2\.1](https://arxiv.org/html/2606.23871#S2.SS2.SSS1.p1.1)\.
- \[9\]I\. Dayan, H\. R\. Roth, A\. Zhong, A\. Harouni, A\. Gentili, A\. Z\. Abidin, A\. Liu, A\. B\. Costa, B\. J\. Wood, C\. Tsai,et al\.\(2021\)Federated learning for predicting clinical outcomes in patients with covid\-19\.Nature medicine27\(10\),pp\. 1735–1743\.Cited by:[§1](https://arxiv.org/html/2606.23871#S1.p3.1)\.
- \[10\]European Commission\(2016\)Regulation \(EU\) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC \(General Data Protection Regulation\)\.European Commission\.External Links:[Link](https://gdpr-info.eu/)Cited by:[§1](https://arxiv.org/html/2606.23871#S1.p1.2)\.
- \[11\]M\. Fernandez\-de\-Retana, U\. Zulaika, R\. Sánchez\-Corcuera, and A\. Almeida\(2025\)Differential privacy: gradient leakage attacks in federated learning environments\.arXiv preprint arXiv:2510\.23931\.Cited by:[§1](https://arxiv.org/html/2606.23871#S1.p3.1)\.
- \[12\]J\. Geiping, H\. Bauermeister, H\. Dröge, and M\. Moeller\(2020\)Inverting gradients\-how easy is it to break privacy in federated learning?\.Advances in neural information processing systems33,pp\. 16937–16947\.Cited by:[§1](https://arxiv.org/html/2606.23871#S1.p3.1)\.
- \[13\]E\. Graf, C\. Schmoor, W\. Sauerbrei, and M\. Schumacher\(1999\)Assessment and comparison of prognostic classification schemes for survival data\.Statistics in Medicine18\(17\-18\),pp\. 2529–2545\.External Links:[Document](https://dx.doi.org/10.1002/%28sici%291097-0258%2819990915/30%2918%3A17/18%3C2529%3A%3Aaid-sim274%3E3.0.co%3B2-5)Cited by:[§3\.1\.3](https://arxiv.org/html/2606.23871#S3.SS1.SSS3.p1.1)\.
- \[14\]D\. Gupta, O\. Kayode, S\. Bhatt, M\. Gupta, and A\. S\. Tosun\(2021\)Hierarchical federated learning based anomaly detection using digital twins for smart healthcare\.In2021 IEEE 7th international conference on collaboration and internet computing \(CIC\),pp\. 16–25\.Cited by:[§1](https://arxiv.org/html/2606.23871#S1.p5.1)\.
- \[15\]F\. E\. Harrell Jr\., R\. M\. Califf, D\. B\. Pryor, K\. L\. Lee, and R\. A\. Rosati\(1982\)Evaluating the Yield of Medical Tests\.JAMA247\(18\),pp\. 2543–2546\.External Links:[Document](https://dx.doi.org/10.1001/jama.1982.03320430047030)Cited by:[§3\.1\.1](https://arxiv.org/html/2606.23871#S3.SS1.SSS1.p1.9)\.
- \[16\]P\. J\. Heagerty, T\. Lumley, and M\. S\. Pepe\(2000\)Time\-dependent ROC curves for censored survival data and a diagnostic marker\.Biometrics56\(2\),pp\. 337–344\.External Links:[Document](https://dx.doi.org/10.1111/j.0006-341X.2000.00337.x)Cited by:[§3\.1\.2](https://arxiv.org/html/2606.23871#S3.SS1.SSS2.p1.1)\.
- \[17\]A\. P\. Heath, V\. Ferretti, S\. Agrawal, M\. An, J\. C\. Angelakos, R\. Arya, R\. Bajari, B\. Baqar, J\. H\. Barnowski, J\. Burt,et al\.\(2021\)The nci genomic data commons\.Nature genetics53\(3\),pp\. 257–262\.Cited by:[§2\.1](https://arxiv.org/html/2606.23871#S2.SS1.p1.1)\.
- \[18\]H\. Ishwaran, U\.B\. Kogalur, E\.H\. Blackstone, and M\.S\. Lauer\(2008\)Random survival forests\.Ann\. Appl\. Statist\.2\(3\),pp\. 841–860\.External Links:[Link](https://arxiv.org/abs/0811.1645v1)Cited by:[§2\.2\.3](https://arxiv.org/html/2606.23871#S2.SS2.SSS3.p1.5)\.
- \[19\]P\. Kairouz and H\. B\. McMahan\(2021\)Advances and open problems in federated learning\.Foundations and trends in machine learning14\(1\-2\),pp\. 1–210\.Cited by:[§1](https://arxiv.org/html/2606.23871#S1.p3.1)\.
- \[20\]J\. L\. Katzman, U\. Shaham, A\. Cloninger, J\. Bates, T\. Jiang, and Y\. Kluger\(2018\)DeepSurv: personalized treatment recommender system using a cox proportional hazards deep neural network\.BMC medical research methodology18\(1\),pp\. 24\.Cited by:[§2\.2\.2](https://arxiv.org/html/2606.23871#S2.SS2.SSS2.p1.2)\.
- \[21\]C\. J\. Kelly, A\. Karthikesalingam, M\. Suleyman, G\. Corrado, and D\. King\(2019\)Key challenges for delivering clinical impact with artificial intelligence\.BMC medicine17\(1\),pp\. 195\.Cited by:[§1](https://arxiv.org/html/2606.23871#S1.p1.2),[§1](https://arxiv.org/html/2606.23871#S1.p2.1)\.
- \[22\]S\. Kim, K\. Kim, J\. Choe, I\. Lee, and J\. Kang\(2020\)Improved survival analysis by learning shared genomic information from pan\-cancer data\.Bioinformatics36\(Supplement\_1\),pp\. i389–i398\.Cited by:[§2\.2\.2](https://arxiv.org/html/2606.23871#S2.SS2.SSS2.p1.2)\.
- \[23\]D\. P\. Kingma and J\. Ba\(2014\)Adam: a method for stochastic optimization\.arXiv preprint arXiv:1412\.6980\.Cited by:[§2\.4](https://arxiv.org/html/2606.23871#S2.SS4.SSS0.Px3.p1.3)\.
- \[24\]D\. G\. Kleinbaum and M\. Klein\(1996\)Survival analysis a self\-learning text\.Springer\.Cited by:[§1](https://arxiv.org/html/2606.23871#S1.p4.1),[§2\.2\.1](https://arxiv.org/html/2606.23871#S2.SS2.SSS1.p1.3)\.
- \[25\]C\. Lee, W\. Zame, J\. Yoon, and M\. Van Der Schaar\(2018\)Deephit: a deep learning approach to survival analysis with competing risks\.InProceedings of the AAAI conference on artificial intelligence,Vol\.32\.Cited by:[§2\.2\.2](https://arxiv.org/html/2606.23871#S2.SS2.SSS2.p1.2)\.
- \[26\]S\. Li, Z\. Wang, Y\. Shang, Q\. Wu, C\. Hong, Y\. Ning, D\. Miao, M\. E\. H\. Ong, B\. Chakraborty, and N\. Liu\(2025\)Developing federated time\-to\-event scores using heterogeneous real\-world survival data\.Computers in Biology and Medicine197,pp\. 111084\.Cited by:[§1](https://arxiv.org/html/2606.23871#S1.p5.1)\.
- \[27\]T\. Li, A\. K\. Sahu, M\. Zaheer, M\. Sanjabi, A\. Talwalkar, and V\. Smith\(2020\)Federated optimization in heterogeneous networks\.Proceedings of Machine learning and systems2,pp\. 429–450\.Cited by:[§2\.4](https://arxiv.org/html/2606.23871#S2.SS4.SSS0.Px2.p1.3),[Table 8](https://arxiv.org/html/2606.23871#S4.T8.1.3.2.1)\.
- \[28\]W\. Li, F\. Milletarì, D\. Xu, N\. Rieke, J\. Hancox, W\. Zhu, M\. Baust, Y\. Cheng, S\. Ourselin, M\. J\. Cardoso,et al\.\(2019\)Privacy\-preserving federated brain tumour segmentation\.InInternational workshop on machine learning in medical imaging,pp\. 133–141\.Cited by:[§1](https://arxiv.org/html/2606.23871#S1.p3.1)\.
- \[29\]X\. Li, K\. Huang, W\. Yang, S\. Wang, and Z\. Zhang\(2020\)On the Convergence of FedAvg on Non\-IID Data\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2606.23871#S1.p3.1),[§4\.1](https://arxiv.org/html/2606.23871#S4.SS1.p1.1)\.
- \[30\]T\. Liao, T\. Su, Y\. Lu, L\. Huang, W\. Wei, and L\. Feng\(2024\)Random survival forest algorithm for risk stratification and survival prediction in gastric neuroendocrine neoplasms\.Scientific Reports14\(1\),pp\. 26969\.Cited by:[§2\.2\.3](https://arxiv.org/html/2606.23871#S2.SS2.SSS3.p1.5)\.
- \[31\]N\. Maslej, L\. Fattorini, R\. Perrault, Y\. Gil, V\. Parli, N\. Kariuki, E\. Capstick, A\. Reuel, E\. Brynjolfsson, J\. Etchemendy,et al\.\(2025\)Artificial intelligence index report 2025\.arXiv preprint arXiv:2504\.07139\.Cited by:[§1](https://arxiv.org/html/2606.23871#S1.p1.2)\.
- \[32\]B\. McMahan, E\. Moore, D\. Ramage, S\. Hampson, and B\. A\. y Arcas\(2017\)Communication\-efficient learning of deep networks from decentralized data\.InArtificial intelligence and statistics,pp\. 1273–1282\.Cited by:[§1](https://arxiv.org/html/2606.23871#S1.p3.1),[§2\.4](https://arxiv.org/html/2606.23871#S2.SS4.SSS0.Px1.p1.3),[Table 8](https://arxiv.org/html/2606.23871#S4.T8.1.2.1.1)\.
- \[33\]L\. Mondrejevski, I\. Miliou, A\. Montanino, D\. Pitts, J\. Hollmén, and P\. Papapetrou\(2022\)FLICU: a federated learning workflow for intensive care unit mortality prediction\.In2022 IEEE 35th International Symposium on Computer\-Based Medical Systems \(CBMS\),pp\. 32–37\.Cited by:[§1](https://arxiv.org/html/2606.23871#S1.p5.1)\.
- \[34\]U\.S\. D\. of Health and H\. Services\(1996\)The Health Insurance Portability and Accountability Act of 1996 \(HIPAA\)\.U\.S\. Department of Health and Human Services\.External Links:[Link](https://www.hhs.gov/hipaa/index.html)Cited by:[§1](https://arxiv.org/html/2606.23871#S1.p1.2)\.
- \[35\]J\. Ogier du Terrail, S\. Ayed, E\. Cyffers, F\. Grimberg, C\. He, R\. Loeb, P\. Mangold, T\. Marchand, O\. Marfoq, E\. Mushtaq,et al\.\(2022\)Flamby: datasets and benchmarks for cross\-silo federated learning in realistic healthcare settings\.Advances in Neural Information Processing Systems35,pp\. 5315–5334\.Cited by:[§1](https://arxiv.org/html/2606.23871#S1.p7.1),[§2\.1](https://arxiv.org/html/2606.23871#S2.SS1.p1.1),[§2\.5](https://arxiv.org/html/2606.23871#S2.SS5.p2.5)\.
- \[36\]A\. Qayyum, K\. Ahmad, M\. A\. Ahsan, A\. Al\-Fuqaha, and J\. Qadir\(2022\)Collaborative federated learning for healthcare: multi\-modal covid\-19 diagnosis at the edge\.IEEE Open Journal of the Computer Society3,pp\. 172–184\.Cited by:[§1](https://arxiv.org/html/2606.23871#S1.p5.1)\.
- \[37\]M\. M\. Rahman and S\. Purushotham\(2022\)Fedpseudo: pseudo value\-based deep learning models for federated survival analysis\.arXiv preprint arXiv:2207\.05247\.Cited by:[§1](https://arxiv.org/html/2606.23871#S1.p5.1)\.
- \[38\]S\. Reddi, Z\. Charles, M\. Zaheer, Z\. Garrett, K\. Rush, J\. Konečnỳ, S\. Kumar, and H\. B\. McMahan\(2020\)Adaptive federated optimization\.arXiv preprint arXiv:2003\.00295\.Cited by:[§2\.4](https://arxiv.org/html/2606.23871#S2.SS4.SSS0.Px3.p1.1),[Table 8](https://arxiv.org/html/2606.23871#S4.T8.1.4.3.1)\.
- \[39\]N\. Rieke, J\. Hancox, W\. Li, F\. Milletari, H\. R\. Roth, S\. Albarqouni, S\. Bakas, M\. N\. Galtier, B\. A\. Landman, K\. Maier\-Hein,et al\.\(2020\)The future of digital health with federated learning\.NPJ digital medicine3\(1\),pp\. 119\.Cited by:[§1](https://arxiv.org/html/2606.23871#S1.p1.2),[§1](https://arxiv.org/html/2606.23871#S1.p3.1)\.
- \[40\]H\. Sung, J\. Ferlay, R\. L\. Siegel, M\. Laversanne, I\. Soerjomataram, A\. Jemal, and F\. Bray\(2021\)Global cancer statistics 2020: globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries\.CA: a cancer journal for clinicians71\(3\),pp\. 209–249\.Cited by:[§1](https://arxiv.org/html/2606.23871#S1.p4.1)\.
- \[41\]E\. J\. Topol\(2019\)High\-performance medicine: the convergence of human and artificial intelligence\.Nature medicine25,pp\. 44–56\.Cited by:[§1](https://arxiv.org/html/2606.23871#S1.p1.2)\.
- \[42\]J\. N\. Weinstein, E\. A\. Collisson, G\. B\. Mills, K\. R\. Shaw, B\. A\. Ozenberger, K\. Ellrott, I\. Shmulevich, C\. Sander, and J\. M\. Stuart\(2013\)The cancer genome atlas pan\-cancer analysis project\.Nature genetics45\(10\),pp\. 1113–1120\.Cited by:[§2\.1](https://arxiv.org/html/2606.23871#S2.SS1.p1.1)\.
- \[43\]L\. Zhu, Z\. Liu, and S\. Han\(2019\)Deep leakage from gradients\.Advances in neural information processing systems32\.Cited by:[§1](https://arxiv.org/html/2606.23871#S1.p3.1)\.

Similar Articles

A Simulated Federated Analysis of MS-Induced Brain Lesions

arXiv cs.LG

This paper introduces a simulation framework for federated analysis of Multiple Sclerosis brain lesions, combining image segmentation with clinical data analysis to test federated learning methods while preserving patient privacy.