SurvivalPFN: Amortizing Survival Prediction via In-Context Bayesian Inference
Summary
SurvivalPFN is a prior-data fitted network that amortizes Bayesian inference for survival analysis via in-context learning, achieving strong predictive performance across 61 datasets without task-specific training or hyperparameter tuning.
View Cached Full Text
Cached at: 05/18/26, 06:41 AM
# SurvivalPFN: Amortizing Survival Prediction via In-Context Bayesian Inference
Source: [https://arxiv.org/html/2605.15488](https://arxiv.org/html/2605.15488)
Shi\-ang Qi1Vahid Balazadeh1,2Michael Cooper1,2 Russell Greiner3,4Rahul G\. Krishnan1,2 1Vector Institute2University of Toronto3University of Alberta 4Alberta Machine Intelligence Institute
###### Abstract
Survival analysis provides a powerful statistical framework for modeling time\-to\-event outcomes in the presence of censoring\. However, selecting an appropriate estimator from the many specialized survival approaches often requires substantial methodological and domain expertise\. We introduce SurvivalPFN, a prior\-data fitted network that amortizes Bayesian inference for censored observations through in\-context learning\. SurvivalPFN is pretrained on a diverse family of synthetic, identifiable, and right\-censored data\-generating processes, enabling it to amortize survival analysis in a single forward pass during inference\. As a result, the model adapts to the effective complexity of each dataset without task\-specific training or hyperparameter tuning, avoids restrictive parametric assumptions, and produces calibrated survival distributions\. In a large\-scale benchmark spanning 61 datasets, 21 methods, and 5 evaluation metrics, SurvivalPFN achieves strong predictive performance and often improves upon established survival models\. These results suggest that SurvivalPFN offers a principled and practical foundation model for survival analysis, with potential applications in high\-impact domains such as healthcare, finance, and engineering \([https://github\.com/rgklab/SurvivalPFN](https://github.com/rgklab/SurvivalPFN)\)\.
## 1Introduction
Figure 1:Computational efficiency vs\. performance across 61 datasets and 5 metrics\.SurvivalPFN achieves the best median rank while matching classical models in speed\.Survival analysis models the distribution of time to an event of interest, with applications spanning medicine\[[56](https://arxiv.org/html/2605.15488#bib.bib2),[92](https://arxiv.org/html/2605.15488#bib.bib3),[81](https://arxiv.org/html/2605.15488#bib.bib153),[10](https://arxiv.org/html/2605.15488#bib.bib6),[9](https://arxiv.org/html/2605.15488#bib.bib95),[20](https://arxiv.org/html/2605.15488#bib.bib94)\], e\-commerce\[[59](https://arxiv.org/html/2605.15488#bib.bib14),[74](https://arxiv.org/html/2605.15488#bib.bib15),[16](https://arxiv.org/html/2605.15488#bib.bib22)\], engineering\[[72](https://arxiv.org/html/2605.15488#bib.bib16),[6](https://arxiv.org/html/2605.15488#bib.bib20),[51](https://arxiv.org/html/2605.15488#bib.bib23)\], and finance\[[60](https://arxiv.org/html/2605.15488#bib.bib25),[23](https://arxiv.org/html/2605.15488#bib.bib24),[25](https://arxiv.org/html/2605.15488#bib.bib26)\]\. Such models are learned and evaluated on data that often exhibitsright\-censoring: for some instances, the event is not observed during the follow\-up period, so we only know that the event time exceeds the censoring time\.
Various survival analysis methods have been proposed to handle right\-censored data, but each imposes different inductive biases\. Classical models such as Cox proportional hazards \(CoxPH\)\[[11](https://arxiv.org/html/2605.15488#bib.bib38)\]often rely on constant hazard ratios and linear covariate effects\. Ensemble methods and deep survival models improve flexibility, but typically require careful tuning and often retain structural assumptions through parametric forms\[[85](https://arxiv.org/html/2605.15488#bib.bib65),[71](https://arxiv.org/html/2605.15488#bib.bib129)\], proportional hazards\[[45](https://arxiv.org/html/2605.15488#bib.bib29)\], fixed time/quantile discretizations\[[103](https://arxiv.org/html/2605.15488#bib.bib82),[55](https://arxiv.org/html/2605.15488#bib.bib30),[73](https://arxiv.org/html/2605.15488#bib.bib127)\], or mixture\-based continuous\-time distributions\[[68](https://arxiv.org/html/2605.15488#bib.bib32),[32](https://arxiv.org/html/2605.15488#bib.bib125)\]\. Consequently, practitioners must navigate a large set of estimators with distinct assumptions and limitations; model selection, training, and validation require substantial domain and methodological expertise\.
This work aims to design a survival estimator that\(i\) avoids rigid simplifying assumptions;\(ii\) adapts to the effective complexity of the observed data; and\(iii\) enables efficient inference without extensive training or hyperparameter tuning\.
Figure 2:Traditional survival analysis vs\. SurvivalPFN\.\(Left\):Traditional survival analysis requires an analyst to select and fit a suitable estimator for the observational data\.\(Right\):SurvivalPFN pre\-trains on diverse synthetic, identifiable DGPs\. At inference, an observed dataset is provided as context, and the survival distributions for query instances are obtained with a single forward pass\.To do so, we build on prior\-data fitted networks \(PFNs\)\[[66](https://arxiv.org/html/2605.15488#bib.bib87)\]: transformer\-based models\[[100](https://arxiv.org/html/2605.15488#bib.bib106)\]that learn in\-context approximations of posterior predictive distributions using synthetic tasks\. Rather than fitting a new survival model for each dataset, SurvivalPFN shifts computation to an offline prior\-data pretraining stage\. At inference time, an observed right\-censored dataset is provided as context, and a single forward pass returns posterior survival distributions for new individuals\. This approach provides a practical route to Bayesian survival prediction that avoids dataset\-specific optimization and extensive hyperparameter tuning\.
We presentSurvivalPFN, a transformer model for survival prediction via in\-context learning\. Our framework uses a general\-purpose prior under conditional independent censoring to generate millions of simulated data generating processes \(DGPs\)\. By training on these diverse DGPs, SurvivalPFN learns to infer conditional survival distributions directly from observed right\-censoring data, yielding an easy\-to\-use and efficient estimator with strong empirical performance; see Figure[1](https://arxiv.org/html/2605.15488#S1.F1)\. Figure[2](https://arxiv.org/html/2605.15488#S1.F2)contrasts the SurvivalPFN workflow with traditional survival modeling\. Our key contributions:
1. 1\.We introduce a framework for amortized Bayesian survival prediction via large\-scale pretraining\. SurvivalPFN uses a single forward pass to adapt to data complexity without task\-specific training or hyperparameter tuning, while avoiding restrictive parametric assumptions\.
2. 2\.We provide theoretical justification for SurvivalPFN as an asymptotically consistent estimator under identifiable right\-censored data\-generating processes\.
3. 3\.We conduct a large\-scale benchmark comparing2121models across6161datasets and55evaluation metrics, making this, to our knowledge, one of the largest survival model benchmarking studies to date\. SurvivalPFN achieves the best median rank\.
4. 4\.We release the code for training SurvivalPFN, together with ascikit\-learn\-style API \(see Supplementary Material\)\.
## 2Background
Survival Analysis and Prediction\.LetX∈ℝdX\\in\\mathbb\{R\}^\{d\}denote a covariate vector, andE,C∈ℝ\+E,C\\in\\mathbb\{R\}\_\{\+\}represent the event and censoring times\. We assume a true joint distributionPPover the tuple\(X,E,C,T,Δ\)\(X,E,C,T,\\Delta\), whereT=min\(E,C\)T=\\min\(E,C\)andΔ=𝟙\[E≤C\]\\Delta=\\mathbbm\{1\}\[E\\leq C\]\. Specifically, we observeNNdraws𝒟=\{\(xi,ti,δi\)\}i=1N\\mathcal\{D\}=\\\{\(x\_\{i\},t\_\{i\},\\delta\_\{i\}\)\\\}\_\{i=1\}^\{N\}from the distribution on observed variablesPobs\(X,T,Δ\)P\_\{\\mathrm\{obs\}\}\(X,T,\\Delta\);EEandCCare latent variables; onlyTTandΔ\\Deltaare observed\. Survival predictors aim to learn the conditional densities or survival functions:
fE∣X\(t∣x\)=Pr\(E=t∣X=x\),or \(equivalently\)SE∣X\(t∣x\)=Pr\(E\>t∣X=x\)\.\\displaystyle f\_\{E\\mid X\}\(t\\mid x\)\\ =\\ \\Pr\(E=t\\mid X=x\),\\quad\\text\{or \(equivalently\)\}\\quad S\_\{E\\mid X\}\(t\\mid x\)\\ =\\ \\Pr\(E\>t\\mid X=x\)\.
Identifiable Survival Analysis\.We say that the conditional survival functionSE∣XS\_\{E\\mid X\}is \(nonparametrically\)*identified*if any two candidate data generating processes that induce the same observed law over\(X,T,Δ\)\(X,T,\\Delta\)must also induce the same event\-time survival function\[[99](https://arxiv.org/html/2605.15488#bib.bib89),[98](https://arxiv.org/html/2605.15488#bib.bib45)\]:
Pobs\(1\)\(X,T,Δ\)=Pobs\(2\)\(X,T,Δ\)⟹SE∣X\(1\)\(t∣x\)=SE∣X\(2\)\(t∣x\),\\displaystyle P^\{\(1\)\}\_\{\\mathrm\{obs\}\}\(X,T,\\Delta\)\\ =\\ P^\{\(2\)\}\_\{\\mathrm\{obs\}\}\(X,T,\\Delta\)\\qquad\\Longrightarrow\\qquad S^\{\(1\)\}\_\{E\\mid X\}\(t\\mid x\)\\ =\\ S^\{\(2\)\}\_\{E\\mid X\}\(t\\mid x\),for almost everyxxand allttin the identifiable support\.
A sufficient condition for nonparametric identification is given by the following standard assumptions\.
###### Assumption 2\.1\(Conditional independent censoring\)\.
E⟂C∣XE\\perp C\\mid X\.
###### Assumption 2\.2\(Positivity\)\.
For a time region𝒯\\mathcal\{T\}of interest,Pr\(C≥t∣X=x\)\>0\\Pr\(C\\geq t\\mid X=x\)\>0,∀t∈𝒯\\forall t\\in\\mathcal\{T\}\.
WhenE⟂̸C∣XE\\not\\perp C\\mid X, the event distribution is generally not nonparametrically identifiable from\(X,T,Δ\)\(X,T,\\Delta\)alone; identification then requires additional assumptions,e\.g\., a specified copula family for the dependence betweenEEandCC\[[24](https://arxiv.org/html/2605.15488#bib.bib1),[105](https://arxiv.org/html/2605.15488#bib.bib96)\]\. Appendix[B](https://arxiv.org/html/2605.15488#A2)includes more theory on identifiability\.
Bayesian Survival Prediction\.Consider a family of identifiable survival data\-generating processes indexed byθ∈Θ\\theta\\in\\Theta, with priorπ\(⋅\)\\pi\(\\cdot\)overΘ\\Theta\. Eachθ\\thetainduces conditional densities and survival functions for both event and censoring times,fE∣X,Θ\(e∣x,θ\)f\_\{E\\mid X,\\Theta\}\(e\\mid x,\\theta\)andSE∣X,Θ\(e∣x,θ\)S\_\{E\\mid X,\\Theta\}\(e\\mid x,\\theta\)\. In Bayesian survival modeling, we place a prior densityfΘ\(θ\)f\_\{\\Theta\}\(\\theta\)and infer the posterior density via Bayes’ rule,
fΘ∣𝒟\(θ∣𝒟\)∝f𝒟∣Θ\(𝒟∣θ\)fΘ\(θ\)\.f\_\{\\Theta\\mid\\mathscr\{D\}\}\(\\theta\\mid\\mathcal\{D\}\)\\ \\propto\\ f\_\{\\mathscr\{D\}\\mid\\Theta\}\(\\mathcal\{D\}\\mid\\theta\)f\_\{\\Theta\}\(\\theta\)\.\(2\.1\)Under conditional independent censoring, the likelihood can be decomposed into:
f𝒟∣Θ\(𝒟∣θ\)=∏i=1N\[fE∣X,Θ\(ti∣xi,θ\)SC∣X,Θ\(ti∣xi,θ\)\]δi\[fC∣X,Θ\(ti∣xi,θ\)SE∣X,Θ\(ti∣xi,θ\)\]1−δi\\displaystyle f\_\{\\mathscr\{D\}\\mid\\Theta\}\(\\mathcal\{D\}\\mid\\theta\)=\\prod\_\{i=1\}^\{N\}\\left\[f\_\{E\\mid X,\\Theta\}\(t\_\{i\}\\mid x\_\{i\},\\theta\)\\,S\_\{C\\mid X,\\Theta\}\(t\_\{i\}\\mid x\_\{i\},\\theta\)\\right\]^\{\\delta\_\{i\}\}\\left\[f\_\{C\\mid X,\\Theta\}\(t\_\{i\}\\mid x\_\{i\},\\theta\)\\,S\_\{E\\mid X,\\Theta\}\(t\_\{i\}\\mid x\_\{i\},\\theta\)\\right\]^\{1\-\\delta\_\{i\}\}
where, forA∈\{E,C\}A\\in\\\{E,C\\\},SA∣X,Θ\(ti∣xi,θ\)=∫ti∞fA∣X,Θ\(τ∣xi,θ\)𝑑τS\_\{A\\mid X,\\Theta\}\(t\_\{i\}\\mid x\_\{i\},\\theta\)=\\int\_\{t\_\{i\}\}^\{\\infty\}f\_\{A\\mid X,\\Theta\}\(\\tau\\mid x\_\{i\},\\theta\)\\,d\\tau\. Given a new covariate vectorx∗x^\{\\ast\}, the Bayesian posterior predictive distribution \(PPD\) of the event time is
fE∣X,𝒟\(t∣x∗,𝒟\)=∫ΘfE∣X,Θ\(t∣x∗,ϑ\)fΘ∣𝒟\(ϑ∣𝒟\)𝑑ϑ\.f\_\{E\\mid X,\\mathscr\{D\}\}\(t\\mid x^\{\\ast\},\\mathcal\{D\}\)\\ =\\ \\int\_\{\\Theta\}f\_\{E\\mid X,\\Theta\}\(t\\mid x^\{\\ast\},\\vartheta\)f\_\{\\Theta\\mid\\mathscr\{D\}\}\(\\vartheta\\mid\\mathcal\{D\}\)\\,d\\vartheta\.\(2\.2\)Analogously, the posterior predictive survival distribution \(PPSD\) is
SE∣X,𝒟\(t∣x∗,𝒟\)=∫ΘSE∣X,Θ\(t∣x∗,ϑ\)fΘ∣𝒟\(ϑ∣𝒟\)𝑑ϑ\.S\_\{E\\mid X,\\mathscr\{D\}\}\(t\\mid x^\{\\ast\},\\mathcal\{D\}\)\\ =\\ \\int\_\{\\Theta\}S\_\{E\\mid X,\\Theta\}\(t\\mid x^\{\\ast\},\\vartheta\)f\_\{\\Theta\\mid\\mathscr\{D\}\}\(\\vartheta\\mid\\mathcal\{D\}\)\\,d\\vartheta\.\(2\.3\)This framework is attractive because it integrates over plausible survival mechanisms rather than committing to a single fitted model\. However, it is difficult to use directly with flexible survival models: evaluating the likelihood can require numerical integration to obtainSE∣X,ΘS\_\{E\\mid X,\\Theta\}; the normalizing constant in Equation[2\.1](https://arxiv.org/html/2605.15488#S2.E1)and the posterior predictive integrals in Equations[2\.2](https://arxiv.org/html/2605.15488#S2.E2)and[2\.3](https://arxiv.org/html/2605.15488#S2.E3)are generally intractable; and approximate inference methods such as Markov chain Monte Carlo \(MCMC\)\[[70](https://arxiv.org/html/2605.15488#bib.bib100),[1](https://arxiv.org/html/2605.15488#bib.bib101),[102](https://arxiv.org/html/2605.15488#bib.bib102)\]or variational inference \(VI\)\[[43](https://arxiv.org/html/2605.15488#bib.bib103),[101](https://arxiv.org/html/2605.15488#bib.bib104),[35](https://arxiv.org/html/2605.15488#bib.bib105),[79](https://arxiv.org/html/2605.15488#bib.bib133)\]must be rerun for each new dataset\. We therefore seek an amortized procedure that preserves the posterior\-predictive interpretation of Bayesian survival prediction while avoiding dataset\-specific posterior computation\.
Prior\-Data Fitted Networks and Amortized Bayesian Inference\.Prior\-data fitted networks \(PFNs\) amortize Bayesian posterior prediction by training transformers on synthetic tasks sampled from a prior\-data generating process\[[66](https://arxiv.org/html/2605.15488#bib.bib87),[67](https://arxiv.org/html/2605.15488#bib.bib86)\]\. Each task consists of a context set and query inputs, and the PFN is trained to predict query targets according to the posterior predictive distribution induced by the prior\. After pretraining, posterior inference is no longer explicit: the transformer’s in\-context computation maps a new dataset and query points directly to predictive distributions in a single forward pass, replacing dataset\-specific MCMC or VI\. This connects PFNs to meta\-learning\[[17](https://arxiv.org/html/2605.15488#bib.bib88)\], but replaces task\-specific adaptation with in\-context inference\. PFNs have achieved strong transfer in tabular prediction\[[36](https://arxiv.org/html/2605.15488#bib.bib91),[37](https://arxiv.org/html/2605.15488#bib.bib92),[84](https://arxiv.org/html/2605.15488#bib.bib154)\], causal effect estimation\[[2](https://arxiv.org/html/2605.15488#bib.bib93),[88](https://arxiv.org/html/2605.15488#bib.bib164)\], and time\-series prediction\[[38](https://arxiv.org/html/2605.15488#bib.bib163),[3](https://arxiv.org/html/2605.15488#bib.bib165)\], motivating our use of this paradigm for amortized Bayesian survival prediction\.
## 3Method
### 3\.1SurvivalPFN: Amortized Posterior Predictive Inference
Overview\.SurvivalPFN learns an in\-context approximation to Bayesian posterior predictive inference for right\-censored survival data\. Instead of specifying a tractable likelihood and performing posterior inference separately for each dataset, we specify a prior through a simulator over identifiable right\-censored DGPs\. A drawθ∼π\(⋅\)\\theta\\sim\\pi\(\\cdot\)determines a joint lawPθP^\{\\theta\}over\(X,E,C,T,Δ\)\(X,E,C,T,\\Delta\)\. For each synthetic task, the simulator produces an observed right\-censored context dataset𝒟θtr=\{\(xi,ti,δi\)θ\}i=1N\\mathcal\{D\}^\{tr\}\_\{\\theta\}=\\\{\(x\_\{i\},t\_\{i\},\\delta\_\{i\}\)\_\{\\theta\}\\\}\_\{i=1\}^\{N\}, together with held\-out query covariatesxθ∗x^\{\\ast\}\_\{\\theta\}and their latent event and censoring times\(eθ∗,cθ∗\)\(e^\{\\ast\}\_\{\\theta\},c^\{\\ast\}\_\{\\theta\}\)\. The latent times are used only during prior\-data training; at inference time, SurvivalPFN receives the same information available in ordinary survival prediction: an observed dataset𝒟\\mathcal\{D\}and query covariatesx∗x^\{\\ast\}\.
Architecture\.Letqωq\_\{\\omega\}denote a transformer with parametersω\\omega\. Given a context dataset and a query covariate, SurvivalPFN outputs a predictive distribution over time\. We additionally provide a binary*query indicator*δ~∗\\widetilde\{\\delta\}^\{\\ast\}, which is a control input specifying which PPD the model should return:
qω\(t\|xθ∗,δ~∗,𝒟θtr\)≈\{fE∣X,𝒟\(t∣xθ∗,𝒟θtr\),δ~∗=1,fC∣X,𝒟\(t∣xθ∗,𝒟θtr\),δ~∗=0\.\\displaystyle q\_\{\\omega\}\\\!\\left\(t\\,\\middle\|\\,x^\{\\ast\}\_\{\\theta\},\\widetilde\{\\delta\}^\{\\ast\},\\mathcal\{D\}^\{tr\}\_\{\\theta\}\\right\)\\ \\approx\\ \\begin\{cases\}f\_\{E\\mid X,\\mathcal\{D\}\}\\\!\\left\(t\\mid x^\{\\ast\}\_\{\\theta\},\\mathcal\{D\}^\{tr\}\_\{\\theta\}\\right\),&\\quad\\widetilde\{\\delta\}^\{\\ast\}=1,\\\\\[5\.69054pt\] f\_\{C\\mid X,\\mathcal\{D\}\}\\\!\\left\(t\\mid x^\{\\ast\}\_\{\\theta\},\\mathcal\{D\}^\{tr\}\_\{\\theta\}\\right\),&\\quad\\widetilde\{\\delta\}^\{\\ast\}=0\.\\end\{cases\}\(3\.1\)Thus,δ~∗=1\\widetilde\{\\delta\}^\{\\ast\}=1asks the model for the event\-time PPD, whose tail probability gives the PPSD, whileδ~∗=0\\widetilde\{\\delta\}^\{\\ast\}=0asks for the posterior predictive censoring distribution \(PPCD\)\. This indicator is not an observed event label for the query point; rather, it specifies the prediction target\.
Figure 3:Training SurvivalPFN\.At each iteration, we sample an identifiable survival DGP and use it to generate context tokens\(X,T,Δ\)\(X,T,\\Delta\)together with query covariatesX∗X^\{\\ast\}\. Query tokens are formed by pairingX∗X^\{\\ast\}with query indicatorsΔ~∗\\widetilde\{\\Delta\}^\{\\ast\}, and SurvivalPFN predicts the requested event\- or censoring\-time distribution\. The model is trained by minimizing the likelihood loss\.SurvivalPFN parameterizesqωq\_\{\\omega\}using the PFN\-style transformer architecture of TabDPT\[[61](https://arxiv.org/html/2605.15488#bib.bib155)\]and CausalPFN\[[2](https://arxiv.org/html/2605.15488#bib.bib93)\]; see Appendix[D\.2](https://arxiv.org/html/2605.15488#A4.SS2)for details\. As shown in Figure[3](https://arxiv.org/html/2605.15488#S3.F3), each context row\(xi,ti,δi\)θ∈𝒟θtr\(x\_\{i\},t\_\{i\},\\delta\_\{i\}\)\_\{\\theta\}\\in\\mathcal\{D\}^\{tr\}\_\{\\theta\}is embedded as a context token, while each query token is formed from\(xθ∗,δ~∗\)\(x^\{\\ast\}\_\{\\theta\},\\widetilde\{\\delta\}^\{\\ast\}\)\. We use three query\-indicator schedules during training:
- •Event\-only:always setsδ~∗=1\\widetilde\{\\delta\}^\{\\ast\}=1and trains the model directly for PPSD prediction;
- •Both:duplicates each query withδ~∗∈\{0,1\}\\widetilde\{\\delta\}^\{\\ast\}\\in\\\{0,1\\\}and trains both event\- and censoring\-time prediction;
- •Random:samples the query indicator according to the empirical censoring pattern in the context\.
The transformer uses an asymmetric attention mask: context tokens attend to one another, while query tokens attend only to the context tokens and not to other queries, as shown by the two\-way and one\-way arrows in Figure[3](https://arxiv.org/html/2605.15488#S3.F3)\. Together with the absence of positional encodings, this makes predictions invariant to the ordering of the context dataset and conditionally independent across query points given the context\.
The model represents each PPD as a discretized histogram overL=1024L=1024time bins\. For each query token, the transformer output is projected to logits over bins, followed by a softmax:
qω\(⋅\|xθ∗,δ~∗,𝒟θtr\)=\[qω,ℓ\(xθ∗,δ~∗,𝒟θtr\)\]ℓ=1L\.\\displaystyle q\_\{\\omega\}\\\!\\left\(\\cdot\\,\\middle\|\\,x^\{\\ast\}\_\{\\theta\},\\widetilde\{\\delta\}^\{\\ast\},\\mathcal\{D\}^\{tr\}\_\{\\theta\}\\right\)\\ =\\ \\left\[q\_\{\\omega,\\ell\}\(x^\{\\ast\}\_\{\\theta\},\\widetilde\{\\delta\}^\{\\ast\},\\mathcal\{D\}^\{tr\}\_\{\\theta\}\)\\right\]\_\{\\ell=1\}^\{L\}\.Before discretization, we apply a monotone transformation of time, such aslognormal2normalortime2quantile, to respect the nonnegative support of survival times and allocate resolution more evenly across the context time range; see Appendix[D\.3](https://arxiv.org/html/2605.15488#A4.SS3)\. The predicted PPSD is obtained by summing the event\-time tail probability mass:
S^ω\(τk∣x∗,𝒟\)=∑ℓ=k\+1Lqω,ℓ\(x∗,δ~∗=1,𝒟\)\.\\widehat\{S\}\_\{\\omega\}\(\\tau\_\{k\}\\mid x^\{\\ast\},\\mathcal\{D\}\)\\ =\\ \\sum\_\{\\ell=k\+1\}^\{L\}q\_\{\\omega,\\ell\}\(x^\{\\ast\},\\widetilde\{\\delta\}^\{\\ast\}=1,\\mathcal\{D\}\)\.\(3\.2\)
Training\.Training follows the PFN principle\[[66](https://arxiv.org/html/2605.15488#bib.bib87),[36](https://arxiv.org/html/2605.15488#bib.bib91)\]\. At each gradient update, we sampleθ∼π\(⋅\)\\theta\\sim\\pi\(\\cdot\), generate a context dataset and query points from the corresponding DGP, and use the simulator\-provided latent times as supervision\. Given the query indicator, the supervised target is
rθ∗\(δ~∗\)=δ~∗eθ∗\+\(1−δ~∗\)cθ∗\.\\displaystyle r^\{\\ast\}\_\{\\theta\}\(\\widetilde\{\\delta\}^\{\\ast\}\)\\ =\\ \\widetilde\{\\delta\}^\{\\ast\}e^\{\\ast\}\_\{\\theta\}\+\\bigl\(1\-\\widetilde\{\\delta\}^\{\\ast\}\\bigr\)c^\{\\ast\}\_\{\\theta\}\.Letg𝒟θtrg\_\{\\mathcal\{D\}^\{tr\}\_\{\\theta\}\}be the context\-fitted monotone time transformation, and letκ𝒟θtr\(r\)∈\{1,…,L\}\\kappa\_\{\\mathcal\{D\}^\{tr\}\_\{\\theta\}\}\(r\)\\in\\\{1,\\ldots,L\\\}denote the transformed\-time bin containingg𝒟θtr\(r\)g\_\{\\mathcal\{D\}^\{tr\}\_\{\\theta\}\}\(r\)\. We train SurvivalPFN with the discrete negative log\-likelihood,
ℒNLL\(ω\)=𝔼θ∼π\(⋅\)𝔼𝒟θtr,xθ∗,eθ∗,cθ∗,δ~∗\[−logqω,κ𝒟θtr\(rθ∗\(δ~∗\)\)\(xθ∗,δ~∗,𝒟θtr\)\]\.\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{NLL\}\}\(\\omega\)\\ =\\ \\mathbb\{E\}\_\{\\theta\\sim\\pi\(\\cdot\)\}\\,\\mathbb\{E\}\_\{\\mathcal\{D\}^\{tr\}\_\{\\theta\},\\,x^\{\\ast\}\_\{\\theta\},\\,e^\{\\ast\}\_\{\\theta\},\\,c^\{\\ast\}\_\{\\theta\},\\,\\widetilde\{\\delta\}^\{\\ast\}\}\\left\[\-\\log q\_\{\\omega,\\kappa\_\{\\mathcal\{D\}^\{tr\}\_\{\\theta\}\}\\\!\\left\(r^\{\\ast\}\_\{\\theta\}\(\\widetilde\{\\delta\}^\{\\ast\}\)\\right\)\}\\\!\\left\(x^\{\\ast\}\_\{\\theta\},\\widetilde\{\\delta\}^\{\\ast\},\\mathcal\{D\}^\{tr\}\_\{\\theta\}\\right\)\\right\]\.\(3\.3\)This objective is a tractable prior\-data likelihood for the requested latent query time\. At the population optimum, the Bayes\-optimal predictor is the conditional distribution of the latent target bin given the observed context and query\. Thus, when the prior is restricted to identifiable right\-censored DGPs, minimizing Equation[3\.3](https://arxiv.org/html/2605.15488#S3.E3)trains SurvivalPFN to approximate the Bayesian PPD induced byπ\(⋅\)\\pi\(\\cdot\)\. We also consider a smoothed cross\-entropy variant, which replaces the one\-hot target with a narrow Gaussian\-smoothed histogram over nearby time bins; see Appendix[D\.4](https://arxiv.org/html/2605.15488#A4.SS4)\.
Inference\.At inference time, SurvivalPFN is applied directly to a real right\-censored dataset𝒟\\mathcal\{D\}and query covariatesx∗x^\{\\ast\}\. A single forward pass withδ~∗=1\\widetilde\{\\delta\}^\{\\ast\}=1returns the event\-time predictive distribution, from which we computeS^ω\(t∣x∗,𝒟\)\\widehat\{S\}\_\{\\omega\}\(t\\mid x^\{\\ast\},\\mathcal\{D\}\)via Equation[3\.2](https://arxiv.org/html/2605.15488#S3.E2)\. No dataset\-specific gradient updates, posterior sampling, variational optimization, or hyperparameter tuning are required\. In this way, pretraining shifts the computational burden from test\-time Bayesian inference to an offline prior\-fitting stage, yielding a reusable amortized Bayesian survival predictor\.
### 3\.2Consistency of the Bayesian Posterior Predictive Target
We next give an informal consistency statement for the Bayesian target learned by SurvivalPFN; a formal version and proof are deferred to Appendix[C](https://arxiv.org/html/2605.15488#A3)\. The key point is that SurvivalPFN is consistent for conditional event distributions that are identifiable from the observed right\-censored data\.
###### Proposition 3\.1\(Informal consistency\)\.
Letθ∗\\theta^\{\\ast\}denote the parameter of the true survival data\-generating process\. Assume the prior over survival DGPs is supported only on identifiable right\-censored mechanisms\. Then, for every timettin the support and for almost every query covariatex∗x^\{\\ast\}, the Bayesian PPSD converges almost surely to the true conditional survival function:
SE∣X,𝒟\(t∣x∗,𝒟\)=∫ΘSE∣X,Θ\(t∣x∗,θ\)fΘ∣𝒟\(θ∣𝒟\)𝑑θ→N→∞a\.s\.SE∣X,Θ\(t∣x∗,θ∗\)\.\\displaystyle S\_\{E\\mid X,\\mathscr\{D\}\}\(t\\mid x^\{\\ast\},\\mathcal\{D\}\)\\ =\\ \\int\_\{\\Theta\}S\_\{E\\mid X,\\Theta\}\(t\\mid x^\{\\ast\},\\theta\)f\_\{\\Theta\\mid\\mathscr\{D\}\}\(\\theta\\mid\\mathcal\{D\}\)\\,d\\theta\\quad\\xrightarrow\[N\\to\\infty\]\{\\mathrm\{a\.s\.\}\}\\quad S\_\{E\\mid X,\\Theta\}\(t\\mid x^\{\\ast\},\\theta^\{\\ast\}\)\.
###### Proof sketch\.
Group survival DGPs into observational equivalence classes:
θ1∼θ2⟺Pobsθ1\(X,T,Δ\)=Pobsθ2\(X,T,Δ\)\.\\displaystyle\\theta\_\{1\}\\ \\sim\\ \\theta\_\{2\}\\quad\\Longleftrightarrow\\quad P^\{\\theta\_\{1\}\}\_\{\\mathrm\{obs\}\}\(X,T,\\Delta\)\\ =\\ P^\{\\theta\_\{2\}\}\_\{\\mathrm\{obs\}\}\(X,T,\\Delta\)\.The observed data can distinguish different equivalence classes, but cannot distinguish DGPs within the same class\. By applying Bayesian consistency to this quotient space, the posterior concentrates on the true observational equivalence class\[θ∗\]\[\\theta^\{\\ast\}\]asN→∞N\\to\\infty\. Therefore, the Bayesian PPSD converges to the posterior average ofSE∣X,Θ\(t∣x∗,θ\)S\_\{E\\mid X,\\Theta\}\(t\\mid x^\{\\ast\},\\theta\)over DGPs that are observationally equivalent toθ∗\\theta^\{\\ast\}, that is, over allθ\\thetasatisfying\[θ\]=\[θ∗\]\[\\theta\]=\[\\theta^\{\\ast\}\]\. If the prior is survival\-identifiable, then every DGP in this equivalence class induces the same conditional event\-time survival function\. Hence this posterior average is equal to the true survival functionSE∣X,Θ\(t∣x∗,θ∗\)S\_\{E\\mid X,\\Theta\}\(t\\mid x^\{\\ast\},\\theta^\{\\ast\}\), which gives the desired consistency\. ∎
Moreover, if SurvivalPFN exactly amortizes this posterior predictive distribution, then its predicted survival curve inherits the same consistency guarantee:
S^ω\(t∣x∗,𝒟\)→N→∞a\.s\.SE∣X,Θ\(t∣x∗,θ∗\)\.\\displaystyle\\widehat\{S\}\_\{\\omega\}\(t\\mid x^\{\\ast\},\\mathcal\{D\}\)\\xrightarrow\[N\\to\\infty\]\{\\mathrm\{a\.s\.\}\}S\_\{E\\mid X,\\Theta\}\(t\\mid x^\{\\ast\},\\theta^\{\\ast\}\)\.
### 3\.3Prior over Identifiable Survival DGPs
Optimizing our loss function does not require an explicit parameterization of the distributionPθP^\{\\theta\}\. Instead, it requires a priorπ\(⋅\)\\pi\(\\cdot\)that can generate observational datasets, consisting of tuples\(xi,ti,δi\)\(x\_\{i\},t\_\{i\},\\delta\_\{i\}\), along with query pointsxθ∗x^\{\\ast\}\_\{\\theta\}with their event/censoring times\. Still, not every choice of prior has the desired properties\. Proposition[3\.1](https://arxiv.org/html/2605.15488#S3.Thmproposition1)highlights two key design principles for the prior:*\(i\)*it should rule out non\-identifiable right\-censored mechanisms; and*\(ii\)*subject to this identifiability constraint, the prior should be broad enough to cover a diverse family of plausible survival mechanisms, increasing the chance that the true DGPθ∗\\theta^\{\\ast\}lies within, or close to, the prior support\.
Our prior generation relies on synthetic random multilayer perceptrons \(MLPs\)\. FollowingHollmannet al\.\[[36](https://arxiv.org/html/2605.15488#bib.bib91)\], we sample a random MLP with various numbers of layers, hidden dimensions, activations, and random initialization\. Specifically, we apply additive Gaussian noise after each layer in the MLP to induce randomness in the generated data\. To sample a synthetic right\-censored dataset from our prior, we first sample a random MLP, apply it to standard Gaussian noise, and generate covariates of varying dimensions as different neurons in the MLP\. Introducing randomness in the design of the MLPs aims to make the generated data more diverse\. Next, we then sample two additional random MLPs, and then apply them on the generated covariates to sample event/censoring times\. We then shift the sampled times to ensure non\-negativity\. By design, the event and censoring times are conditionally independent given the covariates, satisfying the first condition; see Figure[4](https://arxiv.org/html/2605.15488#S3.F4)\(left\)\.
Figure 4:Summary over 500 generated datasets\.\(Left\):Histogram of conditional mutual information\.\(Right\):Diversity coverage over censoring rate and observed\-time dispersion, colored by conditional observed\-time entropy\.To cover even more diverse survival regimes \(see data diversity in Figure[4](https://arxiv.org/html/2605.15488#S3.F4)right\), the prior mixes four families of synthetic survival generators\. The*naive prior*treats generated table outputs as raw event and censoring times\. The*survival\-distribution prior*samples smooth random monotone maps, such as Bernstein maps, and pushes uniform noise through them to obtain flexible distributions for event and censoring\. The*mixture prior*samples event and censoring times from mixtures of Weibull or log\-normal components with covariate\-dependent mixture parameters\. Finally, the*kitchen\-sink prior*is a meta\-prior that samples one of the above generators for each task, thereby increasing prior diversity\.
For each synthetic dataset, we also sample one of four censoring mechanisms:*uniform censoring*, whereCCis sampled from a uniform time range;*random censoring*, whereCCis generated from an independent tabular mechanism;*administrative censoring*, where subjects have different entry times but share a common study end date; and*conditional independent censoring*, where bothEEandCCdepend onXXbut are generated from independent conditional blocks\. The observed survival data are then formed using the standard survival law\. Appendix[D\.1](https://arxiv.org/html/2605.15488#A4.SS1)provides full prior\-generation details\.
## 4Experiments and Results
We conduct extensive experiments to investigate the following research questions \(RQs\):
- RQ1\.How does SurvivalPFN compare with survival baselines in predictive performance?
- RQ2\.How efficient is SurvivalPFN compared with other survival estimators?
- RQ3\.How sensitive is SurvivalPFN to the proportion of training/context samples?
- RQ4\.How does SurvivalPFN compare with general tabular foundation models \(TFMs\)?
- RQ5\.How do priors, query schedules, transformations, and losses affect performance?
Figure 5:Dataset size \(cutoffs at 500 and 5000\), censoring rate and tail rate \(cutoffs at 33% and 67%\)\.We evaluate SurvivalPFN on a large\-scale benchmark covering diverse real\-world regimes\. The benchmark contains 81 datasets \(see Figure[5](https://arxiv.org/html/2605.15488#S4.F5)for a preview\): 20 are used only for SurvivalPFN checkpoint selection, and the remaining 61 are held out for final evaluation\. Datasets are drawn fromSurvSet\[[15](https://arxiv.org/html/2605.15488#bib.bib110)\]and additional textbook, software\-package, and recent\-publication sources\. TableLABEL:tab:dataset\_summaryand Appendix[F](https://arxiv.org/html/2605.15488#A6)provide full dataset descriptions and summaries\.
Models\.We evaluate 21 survival models from five families \(Table[2](https://arxiv.org/html/2605.15488#A5.T2)\); details are in Appendices[E\.1](https://arxiv.org/html/2605.15488#A5.SS1)\-[E\.3](https://arxiv.org/html/2605.15488#A5.SS3)\.
- •Tabular foundation models:SurvivalPFN and StaticSurvivalTFM\[[46](https://arxiv.org/html/2605.15488#bib.bib130)\]\.
- •Classical survival models:CoxPH\[[11](https://arxiv.org/html/2605.15488#bib.bib38)\], CoxNet\[[94](https://arxiv.org/html/2605.15488#bib.bib121)\], cSVR\[[77](https://arxiv.org/html/2605.15488#bib.bib120)\]\.
- •Tree\-based models:GB\[[86](https://arxiv.org/html/2605.15488#bib.bib124)\], CWGB\[[40](https://arxiv.org/html/2605.15488#bib.bib123)\], and RSF\[[41](https://arxiv.org/html/2605.15488#bib.bib49)\]\.
- •Neural discrete\-time models:DeepHit\[[55](https://arxiv.org/html/2605.15488#bib.bib30)\], DeepSurv\[[45](https://arxiv.org/html/2605.15488#bib.bib29)\], MTLR\[[103](https://arxiv.org/html/2605.15488#bib.bib82),[19](https://arxiv.org/html/2605.15488#bib.bib83)\], Nnet\-survival\[[4](https://arxiv.org/html/2605.15488#bib.bib51),[22](https://arxiv.org/html/2605.15488#bib.bib52)\], CoxTime\[[53](https://arxiv.org/html/2605.15488#bib.bib39)\], IWSG\[[31](https://arxiv.org/html/2605.15488#bib.bib126)\], CQRNN\[[73](https://arxiv.org/html/2605.15488#bib.bib127)\], and BNN\-MTLR\[[80](https://arxiv.org/html/2605.15488#bib.bib98)\]\.
- •Neural continuous\-time models:DSM\[[68](https://arxiv.org/html/2605.15488#bib.bib32)\], SuMoNet\[[87](https://arxiv.org/html/2605.15488#bib.bib55)\], SurvivalMDN\[[32](https://arxiv.org/html/2605.15488#bib.bib125)\], DeepAFT\-Weibull\[[71](https://arxiv.org/html/2605.15488#bib.bib129)\], and DeepAFT\-Loglogistic\[[71](https://arxiv.org/html/2605.15488#bib.bib129)\]\.
Metrics\.We evaluate models using five metrics \(Appendix[E\.4](https://arxiv.org/html/2605.15488#A5.SS4)\): IPCW\-adjusted integrated Brier score \(IBS; probabilistic accuracy\)\[[26](https://arxiv.org/html/2605.15488#bib.bib131)\], concordance index \(CI; discrimination\)\[[34](https://arxiv.org/html/2605.15488#bib.bib116)\], D\-calibration \(distributional calibration\)\[[29](https://arxiv.org/html/2605.15488#bib.bib134)\], median\-time mean absolute error \(MAE; time prediction\)\[[79](https://arxiv.org/html/2605.15488#bib.bib133)\], and log\-rank reliability \(agreement with observed time\-to\-event outcomes\)\.
Experimental Protocol\.For each dataset, we conduct 10 repeated experiments using independent 70%/30% train/test splits\. We report the mean±\\pmstandard deviation across the 10 repetitions\. For aggregate comparison, models are ranked separately within each dataset and metric, with rank 1 assigned to the best\-performing method\. Appendices[E\.5](https://arxiv.org/html/2605.15488#A5.SS5)\-[E\.6](https://arxiv.org/html/2605.15488#A5.SS6)provide further details\.
RQ1: Predictive Performance\.Figure[6](https://arxiv.org/html/2605.15488#S4.F6)summarizes model ranks across all benchmark datasets, with better\-performing methods appearing farther to the right\. SurvivalPFN achieves the strongest overall rank among the compared methods, indicating robust performance across the full benchmark suite\. Metric\-wise, SurvivalPFN is among the leading methods for IBS, MAE, and Log\-Rank, showing strong probabilistic survival prediction, accurate time estimation, and strong agreement with observed time\-to\-event outcomes\. It also remains competitive on CI and D\-calibration, although several other baselines \(e\.g\., RSF, CWGB, DeepSurv\) rank higher in these metric\-specific rankings\.
Additional stratified results \(based on sample size and censoring rate\) are provided in Appendix[G\.1](https://arxiv.org/html/2605.15488#A7.SS1)\. There, SurvivalPFN shows its largest advantage on small datasets \(Figure[11](https://arxiv.org/html/2605.15488#A7.F11)\), while its relative performance decreases as dataset size increases \(Figures[12](https://arxiv.org/html/2605.15488#A7.F12)and[13](https://arxiv.org/html/2605.15488#A7.F13)\)\. This behavior is consistent with themotivation of amortized Bayesian survival prediction: when each downstream dataset contains limited observations, SurvivalPFN can leverage the inductive structure learned during prior\-data pretraining rather than fitting a flexible survival model from scratch\. In contrast, across low\-, medium\-, and high\-censoring regimes \(Figures[14](https://arxiv.org/html/2605.15488#A7.F14)\-[16](https://arxiv.org/html/2605.15488#A7.F16)\), SurvivalPFN remains consistently strong\.
Figure 6:Model ranks across 61 benchmark datasets\.Points/stars denote median ranks across datasets, with horizontal bars showing 95% bootstrap confidence intervals for the median rank\.RQ2: Computational Efficiency\.Figure[1](https://arxiv.org/html/2605.15488#S1.F1)compares the computational efficiency of SurvivalPFN with all baselines\. We report the total training\-plus\-inference time across datasets, excluding hyperparameter\-tuning time for neural\-network\-based methods\. SurvivalPFN is highly efficient: it is only modestly slower than CoxNet, the fastest baseline, while achieving the best overall ranking across the five evaluation metrics\. By contrast, tree\-based and neural\-network\-based methods require substantially greater computation\.
Figure 7:Performance on the PBC dataset for SurvivalPFN and top\-performing models\.Shaded regions denote standard errors over 10 repeated runs\.RQ3: Sensitivity to Training\-Set Size\.Figure[7](https://arxiv.org/html/2605.15488#S4.F7)shows how model performance changes as the fraction of training data varies\. SurvivalPFN shows stable predictive performance across split ratios, maintaining consistently low IBS and high CI even when only a small fraction of the data is used for training\. In contrast, several baselines are more sensitive to limited training data: CoxNet, GB, and CWGB perform poorly at the smallest ratio and improve markedly as more data become available\. These trends suggest that SurvivalPFN is comparatively robust in low\-data regimes\. Appendix[G\.3](https://arxiv.org/html/2605.15488#A7.SS3)and Figure[17](https://arxiv.org/html/2605.15488#A7.F17)contain results for more datasets\.
RQ4: Comparison with General TFMs\.Figure[8](https://arxiv.org/html/2605.15488#S4.F8)compares SurvivalPFN with general\-purpose tabular foundational regressors by training them only on uncensored instances\. SurvivalPFN achieves the best overall rank and is consistently the top\-ranked method across all metrics\. This suggests that directly adapting generic TFMs to survival outcomes is insufficient: explicitly pretraining on identifiable right\-censored survival tasks yields substantially more reliable survival prediction\.
RQ5: Ablation Studies\.Due to space constraints, Appendix[G\.5](https://arxiv.org/html/2605.15488#A7.SS5)reports how prior design, query schedules, monotone transformations, and objective functions affect SurvivalPFN performance\.
Figure 8:Comparison of SurvivalPFN with selected general TFMs across 61 benchmark datasets\.Plotting conventions follow Figure[6](https://arxiv.org/html/2605.15488#S4.F6)\.
## 5Related Work
Classical and Deep Survival Analysis\.A broad range of estimators has been developed for right\-censored data\. CoxPH\[[11](https://arxiv.org/html/2605.15488#bib.bib38)\]relies on proportional hazards and linear covariate effects, while fully parametric models such as exponential\[[62](https://arxiv.org/html/2605.15488#bib.bib78)\]and Weibull\[[75](https://arxiv.org/html/2605.15488#bib.bib79)\]models impose explicit distributional forms\. Modern machine\-learning methods improve flexibility, but introduce other assumptions\. Neural Cox\-based models, such as DeepSurv\[[45](https://arxiv.org/html/2605.15488#bib.bib29)\]and CoxTime\[[53](https://arxiv.org/html/2605.15488#bib.bib39)\], relax linear covariate effects but retain Cox\-style hazard modeling\. Discrete\-time models, including MTLR\[[103](https://arxiv.org/html/2605.15488#bib.bib82),[19](https://arxiv.org/html/2605.15488#bib.bib83)\]and DeepHit\[[55](https://arxiv.org/html/2605.15488#bib.bib30),[54](https://arxiv.org/html/2605.15488#bib.bib31)\], depend on a chosen time grid and may become overparameterized with many bins\. Continuous\-time neural models are more flexible, but still impose structure through parametric mixtures\[[68](https://arxiv.org/html/2605.15488#bib.bib32)\], latent\-variable models\[[85](https://arxiv.org/html/2605.15488#bib.bib65),[64](https://arxiv.org/html/2605.15488#bib.bib81)\], monotonic density estimators\[[87](https://arxiv.org/html/2605.15488#bib.bib55),[8](https://arxiv.org/html/2605.15488#bib.bib71)\], or neural ODE hazards\[[28](https://arxiv.org/html/2605.15488#bib.bib67),[97](https://arxiv.org/html/2605.15488#bib.bib66)\]\.
Bayesian Survival Analysis\.Bayesian survival models quantify uncertainty by placing priors over model parameters and integrating over posterior uncertainty\. Recent neural variants include BNN\-ISD, which uses Bayesian neural networks to obtain credible intervals and supports feature selection\[[80](https://arxiv.org/html/2605.15488#bib.bib98)\]; Bayesian LSTM\-SURV, which combines the survival likelihood with Bayesian mixed\-effects updating under a Weibull parametric form\[[21](https://arxiv.org/html/2605.15488#bib.bib161)\]; and NeuralSurv, which uses variational inference for Bayesian deep survival prediction\[[65](https://arxiv.org/html/2605.15488#bib.bib160)\]\. While these methods demonstrate the value of Bayesian uncertainty quantification, they remain tied to method\-specific assumptions and per\-dataset inference or updating\.
Concurrent Work on TFMs for Survival Analysis\.Two concurrent works also explore TFMs for survival analysis\.Kimet al\.\[[46](https://arxiv.org/html/2605.15488#bib.bib130)\]convert survival prediction to a sequence of binary classification tasks over discretized time points, enabling off\-the\-shelf TFMs without survival\-specific pretraining\. This simple reduction avoids new pretraining, but expands each instance into many time\-indexed examples; as a result, context length scales with the number of bins, larger datasets require subsampling, and performance can suffer when the TFM sees only a compressed view of the risk set\.Seletkovet al\.\[[91](https://arxiv.org/html/2605.15488#bib.bib157)\]instead pretrain a survival\-specific in\-context model on synthetic data from parametric extended\-hazard mechanisms\. This is closer to SurvivalPFN, but its prior is limited to parametric hazard families and random censoring \(E⟂CE\\perp C\), potentially limiting broader use\. In contrast, SurvivalPFN uses a broader family of identifiable right\-censored DGPs, supports covariate\-dependent censoring under conditional independence, avoids explicit parametric structure, and provides a posterior\-predictive consistency argument\. We includeKimet al\.\[[46](https://arxiv.org/html/2605.15488#bib.bib130)\]’s static formulation asStaticSurvivalTFM; SIC is not directly compared because public weights/code are not yet available\.
## 6Conclusions, Limitations, and Future Work
We introduced SurvivalPFN, a prior\-data fitted network for amortized Bayesian survival prediction from right\-censored data\. By pretraining on diverse, identifiable survival data\-generating processes, SurvivalPFN produces posterior predictive survival distributions in a single forward pass, without dataset\-specific training or hyperparameter tuning\. Across 61 real\-world datasets, SurvivalPFN achieves strong overall performance while remaining computationally efficient, suggesting that PFN\-style in\-context learning is a promising foundation for flexible survival prediction\. Because survival models already inform high\-stakes decisions – from clinical risk scores\[[57](https://arxiv.org/html/2605.15488#bib.bib119)\]to resource\-allocation policies\[[47](https://arxiv.org/html/2605.15488#bib.bib76)\]– we believe that advances in accuracy, efficiency, and uncertainty quantification for survival models can generate substantial human benefit\.
Several limitations remain\. First, SurvivalPFN relies on conditional independent censoring to preserve identifiability; under dependent censoring, the event\-time distribution is not nonparametrically identifiable from observed data alone, and our method is not valid\. Extending SurvivalPFN to identifiable dependent\-censoring priors, such as copula methods\[[24](https://arxiv.org/html/2605.15488#bib.bib1),[105](https://arxiv.org/html/2605.15488#bib.bib96)\], is an important direction\. Additionally, SurvivalPFN inherits the size\-scalability trade\-off of PFN\-style models, with reduced relative performance on larger tables\[[37](https://arxiv.org/html/2605.15488#bib.bib92),[2](https://arxiv.org/html/2605.15488#bib.bib93)\]; improving long\-context inference is left for future work\.
## References
- \[1\]\(2003\)An introduction to mcmc for machine learning\.Machine learning50\(1\),pp\. 5–43\.Cited by:[§2](https://arxiv.org/html/2605.15488#S2.p5.11)\.
- \[2\]V\. Balazadeh, H\. Kamkari, V\. Thomas, B\. Li, J\. Ma, J\. C\. Cresswell, and R\. G\. Krishnan\(2025\)CausalPFN: amortized causal effect estimation via in\-context learning\.arXiv preprint arXiv:2506\.07918\.Cited by:[§D\.2](https://arxiv.org/html/2605.15488#A4.SS2.p1.6),[§D\.2](https://arxiv.org/html/2605.15488#A4.SS2.p4.3),[§D\.4](https://arxiv.org/html/2605.15488#A4.SS4.SSS0.Px2.p1.6),[§2](https://arxiv.org/html/2605.15488#S2.p6.1),[§3\.1](https://arxiv.org/html/2605.15488#S3.SS1.p3.3),[§6](https://arxiv.org/html/2605.15488#S6.p2.1)\.
- \[3\]D\. Berghaus, P\. Seifner, K\. Cvejoski, C\. Ojeda, and R\. J\. Sánchez\(2025\)In\-context learning of temporal point processes with foundation inference models\.arXiv preprint arXiv:2509\.24762\.Cited by:[§2](https://arxiv.org/html/2605.15488#S2.p6.1)\.
- \[4\]E\. Biganzoli, P\. Boracchi, L\. Mariani, and E\. Marubini\(1998\)Feed forward neural networks for the analysis of censored survival data: a partial logistic regression approach\.Statistics in medicine17\(10\),pp\. 1169–1186\.Cited by:[item 12](https://arxiv.org/html/2605.15488#A5.I1.i12.p1.1),[4th item](https://arxiv.org/html/2605.15488#S4.I2.i4.p1.1)\.
- \[5\]N\. E\. Breslow\(1975\)Analysis of survival data under the proportional hazards model\.International Statistical Review/Revue Internationale de Statistique,pp\. 45–57\.Cited by:[item 3](https://arxiv.org/html/2605.15488#A5.I1.i3.p1.1)\.
- \[6\]D\. Broadway, M\. Iester, M\. Schulzer, and G\. Douglas\(2001\)Survival analysis for success of molteno tube implants\.British Journal of Ophthalmology85\(6\),pp\. 689–695\.Cited by:[§1](https://arxiv.org/html/2605.15488#S1.p1.1)\.
- \[7\]D\. Chicco and G\. Jurman\(2020\)Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone\.BMC medical informatics and decision making20\(1\),pp\. 16\.Cited by:[6th item](https://arxiv.org/html/2605.15488#A6.I1.i6.p1.1)\.
- \[8\]P\. Chilinski and R\. Silva\(2020\)Neural likelihoods via cumulative distribution functions\.InConference on Uncertainty in Artificial Intelligence,pp\. 420–429\.Cited by:[§5](https://arxiv.org/html/2605.15488#S5.p1.1)\.
- \[9\]M\. J\. Cooper, X\. Gao, X\. Zhao, D\. Khoroshchuk, Y\. Wang, A\. Azhie, M\. Naghibzadeh, S\. Holdsworth, J\. A\. Gross, M\. Brudno,et al\.\(2024\)DynaMELD: a dynamic model of end\-stage liver disease for equitable prioritization\.medRxiv,pp\. 2024–11\.Cited by:[§1](https://arxiv.org/html/2605.15488#S1.p1.1)\.
- \[10\]M\. Cooper, Z\. Ji, and R\. G\. Krishnan\(2023\)Machine learning in computational histopathology: challenges and opportunities\.Genes, Chromosomes and Cancer62\(9\),pp\. 540–556\.Cited by:[§1](https://arxiv.org/html/2605.15488#S1.p1.1)\.
- \[11\]D\. R\. Cox\(1972\)Regression models and life\-tables\.Journal of the Royal Statistical Society: Series B \(Methodological\)34\(2\),pp\. 187–202\.Cited by:[item 3](https://arxiv.org/html/2605.15488#A5.I1.i3.p1.1),[§1](https://arxiv.org/html/2605.15488#S1.p2.1),[2nd item](https://arxiv.org/html/2605.15488#S4.I2.i2.p1.1),[§5](https://arxiv.org/html/2605.15488#S5.p1.1)\.
- \[12\]C\. Curtis, S\. P\. Shah, S\. Chin, G\. Turashvili, O\. M\. Rueda, M\. J\. Dunning, D\. Speed, A\. G\. Lynch, S\. Samarajiwa, Y\. Yuan,et al\.\(2012\)The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups\.Nature486\(7403\),pp\. 346–352\.Cited by:[16th item](https://arxiv.org/html/2605.15488#A6.I1.i16.p1.1)\.
- \[13\]A\. Defazio, X\. Yang, H\. Mehta, K\. Mishchenko, A\. Khaled, and A\. Cutkosky\(2024\)The road less scheduled\.Advances in Neural Information Processing Systems37,pp\. 9974–10007\.Cited by:[§D\.2](https://arxiv.org/html/2605.15488#A4.SS2.p3.1)\.
- \[14\]J\. L\. Doob\(1949\)Application of the theory of martingales\.Le Calcul des Probabilites et ses Applications,pp\. 23–27\.External Links:[Link](https://cir.nii.ac.jp/crid/1573387449499005824)Cited by:[§C\.3](https://arxiv.org/html/2605.15488#A3.SS3.2.p2.2)\.
- \[15\]E\. Drysdale\(2022\)SurvSet: an open\-source time\-to\-event dataset repository\.arXiv preprint arXiv:2203\.03094\.Cited by:[Appendix F](https://arxiv.org/html/2605.15488#A6.p1.1),[Appendix F](https://arxiv.org/html/2605.15488#A6.p4.1),[§4](https://arxiv.org/html/2605.15488#S4.p2.1)\.
- \[16\]J\. P\. Equihua, H\. Nordmark, M\. Ali, and B\. Lausen\(2023\)Modelling customer churn for the retail industry in a deep learning based sequential framework\.arXiv preprint arXiv:2304\.00575\.Cited by:[§1](https://arxiv.org/html/2605.15488#S1.p1.1)\.
- \[17\]C\. Finn, P\. Abbeel, and S\. Levine\(2017\)Model\-agnostic meta\-learning for fast adaptation of deep networks\.InInternational conference on machine learning,pp\. 1126–1135\.Cited by:[§2](https://arxiv.org/html/2605.15488#S2.p6.1)\.
- \[18\]S\. Fotsoet al\.\(2019–\)PySurvival: open source package for survival analysis modeling\.External Links:[Link](https://www.pysurvival.io/)Cited by:[11st item](https://arxiv.org/html/2605.15488#A6.I1.i11.p1.1),[12nd item](https://arxiv.org/html/2605.15488#A6.I1.i12.p1.1),[15th item](https://arxiv.org/html/2605.15488#A6.I1.i15.p1.1),[22nd item](https://arxiv.org/html/2605.15488#A6.I1.i22.p1.1)\.
- \[19\]S\. Fotso\(2018\)Deep neural networks for survival analysis based on a multi\-task framework\.arXiv preprint arXiv:1801\.05512\.Cited by:[item 11](https://arxiv.org/html/2605.15488#A5.I1.i11.p1.1),[4th item](https://arxiv.org/html/2605.15488#S4.I2.i4.p1.1),[§5](https://arxiv.org/html/2605.15488#S5.p1.1)\.
- \[20\]X\. Gao, M\. Cooper, M\. Naghibzadeh, A\. Azhie, M\. Bhat, and R\. G\. Krishnan\(2024\)Predicting long\-term allograft survival in liver transplant recipients\.arXiv preprint arXiv:2408\.05437\.Cited by:[§1](https://arxiv.org/html/2605.15488#S1.p1.1)\.
- \[21\]Y\. Gao, S\. Li, D\. Wang, J\. Mao, and L\. Ouyang\(2025\)A neural network\-based survival analysis model considering censored data for failure prediction\.IEEE Transactions on Automation Science and Engineering22,pp\. 24585–24598\.Cited by:[§5](https://arxiv.org/html/2605.15488#S5.p2.1)\.
- \[22\]M\. F\. Gensheimer and B\. Narasimhan\(2019\)A scalable discrete\-time survival model for neural networks\.PeerJ7,pp\. e6257\.Cited by:[item 12](https://arxiv.org/html/2605.15488#A5.I1.i12.p1.1),[4th item](https://arxiv.org/html/2605.15488#S4.I2.i4.p1.1)\.
- \[23\]A\. Gepp and K\. Kumar\(2008\)The role of survival analysis in financial distress prediction\.International research journal of finance and economics16\(16\),pp\. 13–34\.Cited by:[§1](https://arxiv.org/html/2605.15488#S1.p1.1)\.
- \[24\]A\. H\. F\. Gharari, M\. Cooper, R\. Greiner, and R\. G\. Krishnan\(2023\)Copula\-based deep survival models for dependent censoring\.InUncertainty in Artificial Intelligence,pp\. 669–680\.Cited by:[§2](https://arxiv.org/html/2605.15488#S2.p4.4),[§6](https://arxiv.org/html/2605.15488#S6.p2.1)\.
- \[25\]P\. Giot and A\. Schwienbacher\(2007\)IPOs, trade sales and liquidations: modelling venture capital exits using survival analysis\.Journal of Banking & Finance31\(3\),pp\. 679–702\.Cited by:[§1](https://arxiv.org/html/2605.15488#S1.p1.1)\.
- \[26\]E\. Graf, C\. Schmoor, W\. Sauerbrei, and M\. Schumacher\(1999\)Assessment and comparison of prognostic classification schemes for survival data\.Statistics in medicine18\(17\-18\),pp\. 2529–2545\.Cited by:[§E\.4](https://arxiv.org/html/2605.15488#A5.SS4.SSS0.Px2.p1.4),[§4](https://arxiv.org/html/2605.15488#S4.p4.1)\.
- \[27\]L\. Grinsztajn, K\. Flöge, O\. Key, F\. Birkel, P\. Jund, B\. Roof, B\. Jäger, D\. Safaric, S\. Alessi, A\. Hayler,et al\.\(2025\)Tabpfn\-2\.5: advancing the state of the art in tabular foundation models\.arXiv preprint arXiv:2511\.08667\.Cited by:[§G\.4](https://arxiv.org/html/2605.15488#A7.SS4.p1.1),[Appendix H](https://arxiv.org/html/2605.15488#A8.SS0.SSS0.Px1.p1.4)\.
- \[28\]S\. Groha, S\. M\. Schmon, and A\. Gusev\(2020\)A general framework for survival analysis and multi\-state modelling\.arXiv preprint arXiv:2006\.04893\.Cited by:[§5](https://arxiv.org/html/2605.15488#S5.p1.1)\.
- \[29\]H\. Haider, B\. Hoehn, S\. Davis, and R\. Greiner\(2020\)Effective ways to build and evaluate individual survival distributions\.Journal of Machine Learning Research21\(85\),pp\. 1–63\.Cited by:[§E\.4](https://arxiv.org/html/2605.15488#A5.SS4.SSS0.Px4.p1.8),[§4](https://arxiv.org/html/2605.15488#S4.p4.1)\.
- \[30\]S\. M\. Hammer, D\. A\. Katzenstein, M\. D\. Hughes, H\. Gundacker, R\. T\. Schooley, R\. H\. Haubrich, W\. K\. Henry, M\. M\. Lederman, J\. P\. Phair, M\. Niu,et al\.\(1996\)A trial comparing nucleoside monotherapy with combination therapy in hiv\-infected adults with cd4 cell counts from 200 to 500 per cubic millimeter\.New England Journal of Medicine335\(15\),pp\. 1081–1090\.Cited by:[17th item](https://arxiv.org/html/2605.15488#A6.I1.i17.p1.1)\.
- \[31\]X\. Han, M\. Goldstein, A\. Puli, T\. Wies, A\. Perotte, and R\. Ranganath\(2021\)Inverse\-weighted survival games\.Advances in neural information processing systems34,pp\. 2160–2172\.Cited by:[item 14](https://arxiv.org/html/2605.15488#A5.I1.i14.p1.1),[4th item](https://arxiv.org/html/2605.15488#S4.I2.i4.p1.1)\.
- \[32\]X\. Han, M\. Goldstein, and R\. Ranganath\(2022\)Survival mixture density networks\.InMachine Learning for Healthcare Conference,pp\. 224–248\.Cited by:[item 19](https://arxiv.org/html/2605.15488#A5.I1.i19.p1.1),[§1](https://arxiv.org/html/2605.15488#S1.p2.1),[5th item](https://arxiv.org/html/2605.15488#S4.I2.i5.p1.1)\.
- \[33\]F\. E\. Harrell, R\. M\. Califf, D\. B\. Pryor, K\. L\. Lee, and R\. A\. Rosati\(1982\)Evaluating the yield of medical tests\.Jama247\(18\),pp\. 2543–2546\.Cited by:[§E\.4](https://arxiv.org/html/2605.15488#A5.SS4.SSS0.Px1.p1.2)\.
- \[34\]F\. E\. Harrell Jr, K\. L\. Lee, and D\. B\. Mark\(1996\)Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors\.Statistics in medicine15\(4\),pp\. 361–387\.Cited by:[§4](https://arxiv.org/html/2605.15488#S4.p4.1)\.
- \[35\]M\. D\. Hoffman, D\. M\. Blei, C\. Wang, and J\. Paisley\(2013\)Stochastic variational inference\.the Journal of machine Learning research14\(1\),pp\. 1303–1347\.Cited by:[§2](https://arxiv.org/html/2605.15488#S2.p5.11)\.
- \[36\]N\. Hollmann, S\. Müller, K\. Eggensperger, and F\. Hutter\(2022\)Tabpfn: a transformer that solves small tabular classification problems in a second\.arXiv preprint arXiv:2207\.01848\.Cited by:[§D\.2](https://arxiv.org/html/2605.15488#A4.SS2.p2.1),[§2](https://arxiv.org/html/2605.15488#S2.p6.1),[§3\.1](https://arxiv.org/html/2605.15488#S3.SS1.p5.1),[§3\.3](https://arxiv.org/html/2605.15488#S3.SS3.p2.1)\.
- \[37\]N\. Hollmann, S\. Müller, L\. Purucker, A\. Krishnakumar, M\. Körfer, S\. B\. Hoo, R\. T\. Schirrmeister, and F\. Hutter\(2025\)Accurate predictions on small data with a tabular foundation model\.Nature637\(8045\),pp\. 319–326\.Cited by:[§D\.2](https://arxiv.org/html/2605.15488#A4.SS2.p2.1),[§2](https://arxiv.org/html/2605.15488#S2.p6.1),[§6](https://arxiv.org/html/2605.15488#S6.p2.1)\.
- \[38\]S\. B\. Hoo, S\. Müller, D\. Salinas, and F\. Hutter\(2025\)From tables to time: extending tabpfn\-v2 to time series forecasting\.arXiv preprint arXiv:2501\.02945\.Cited by:[§2](https://arxiv.org/html/2605.15488#S2.p6.1)\.
- \[39\]D\. W\. Hosmer Jr, S\. Lemeshow, and S\. May\(2008\)Applied survival analysis: regression modeling of time\-to\-event data\.John Wiley & Sons\.Cited by:[14th item](https://arxiv.org/html/2605.15488#A6.I1.i14.p1.1)\.
- \[40\]T\. Hothorn, P\. Bühlmann, S\. Dudoit, A\. Molinaro, and M\. J\. Van Der Laan\(2006\)Survival ensembles\.Biostatistics7\(3\),pp\. 355–373\.Cited by:[item 7](https://arxiv.org/html/2605.15488#A5.I1.i7.p1.1),[3rd item](https://arxiv.org/html/2605.15488#S4.I2.i3.p1.1)\.
- \[41\]H\. Ishwaran, U\. B\. Kogalur, E\. H\. Blackstone, M\. S\. Lauer,et al\.\(2008\)Random survival forests\.Annals of Applied Statistics2\(3\),pp\. 841–860\.Cited by:[item 8](https://arxiv.org/html/2605.15488#A5.I1.i8.p1.1),[3rd item](https://arxiv.org/html/2605.15488#S4.I2.i3.p1.1)\.
- \[42\]A\. E\. Johnson, L\. Bulgarelli, L\. Shen, A\. Gayles, A\. Shammout, S\. Horng, T\. J\. Pollard, S\. Hao, B\. Moody, B\. Gow,et al\.\(2023\)MIMIC\-iv, a freely accessible electronic health record dataset\.Scientific data10\(1\),pp\. 1\.Cited by:[21st item](https://arxiv.org/html/2605.15488#A6.I1.i21.p1.1)\.
- \[43\]M\. I\. Jordan, Z\. Ghahramani, T\. S\. Jaakkola, and L\. K\. Saul\(1999\)An introduction to variational methods for graphical models\.Machine learning37\(2\),pp\. 183–233\.Cited by:[§2](https://arxiv.org/html/2605.15488#S2.p5.11)\.
- \[44\]O\. Kardaun\(1983\)Statistical survival analysis of male larynx\-cancer patients\-a case study\.Statistica neerlandica37\(3\),pp\. 103–125\.Cited by:[2nd item](https://arxiv.org/html/2605.15488#A6.I1.i2.p1.1)\.
- \[45\]J\. L\. Katzman, U\. Shaham, A\. Cloninger, J\. Bates, T\. Jiang, and Y\. Kluger\(2018\)DeepSurv: personalized treatment recommender system using a cox proportional hazards deep neural network\.BMC medical research methodology18,pp\. 1–12\.Cited by:[item 10](https://arxiv.org/html/2605.15488#A5.I1.i10.p1.1),[§1](https://arxiv.org/html/2605.15488#S1.p2.1),[4th item](https://arxiv.org/html/2605.15488#S4.I2.i4.p1.1),[§5](https://arxiv.org/html/2605.15488#S5.p1.1)\.
- \[46\]D\. I\. Kim, W\. S\. Lai, and K\. W\. Zhang\(2026\)Tabular foundation models can do survival analysis\.arXiv preprint arXiv:2601\.22259\.Cited by:[item 2](https://arxiv.org/html/2605.15488#A5.I1.i2.p1.1),[§G\.4](https://arxiv.org/html/2605.15488#A7.SS4.p3.1),[§G\.4](https://arxiv.org/html/2605.15488#A7.SS4.p5.1),[Appendix H](https://arxiv.org/html/2605.15488#A8.SS0.SSS0.Px1.p1.5),[1st item](https://arxiv.org/html/2605.15488#S4.I2.i1.p1.1),[§5](https://arxiv.org/html/2605.15488#S5.p3.1)\.
- \[47\]W\. R\. Kim, A\. Mannalithara, J\. K\. Heimbach, P\. S\. Kamath, S\. K\. Asrani, S\. W\. Biggins, N\. L\. Wood, S\. E\. Gentry, and A\. J\. Kwong\(2021\)MELD 3\.0: the model for end\-stage liver disease updated for the modern era\.Gastroenterology161\(6\),pp\. 1887–1895\.Cited by:[§6](https://arxiv.org/html/2605.15488#S6.p1.1)\.
- \[48\]D\. P\. Kingma and J\. Ba\(2014\)Adam: a method for stochastic optimization\.arXiv preprint arXiv:1412\.6980\.Cited by:[§D\.2](https://arxiv.org/html/2605.15488#A4.SS2.p3.1)\.
- \[49\]J\. P\. Klein and M\. L\. Moeschberger\(2003\)Survival analysis: techniques for censored and truncated data\.Vol\.1230,Springer\.Cited by:[10th item](https://arxiv.org/html/2605.15488#A6.I1.i10.p1.1)\.
- \[50\]D\. G\. Kleinbaum and M\. Klein\(1996\)Survival analysis a self\-learning text\.Springer\.Cited by:[1st item](https://arxiv.org/html/2605.15488#A6.I1.i1.p1.1)\.
- \[51\]D\. Kuhajda\(2016\)Using survival analysis to evaluate medical equipment battery life\.Biomedical instrumentation & technology50\(3\),pp\. 184–189\.Cited by:[§1](https://arxiv.org/html/2605.15488#S1.p1.1)\.
- \[52\]N\. Kumar, S\. Qi, L\. Kuan, W\. Sun, J\. Zhang, and R\. Greiner\(2022\)Learning accurate personalized survival models for predicting hospital discharge and mortality of covid\-19 patients\.Scientific reports12\(1\),pp\. 4472\.Cited by:[13rd item](https://arxiv.org/html/2605.15488#A6.I1.i13.p1.1)\.
- \[53\]H\. Kvamme, Ø\. Borgan, and I\. Scheel\(2019\)Time\-to\-event prediction with neural networks and cox regression\.Journal of machine learning research20\(129\),pp\. 1–30\.Cited by:[item 13](https://arxiv.org/html/2605.15488#A5.I1.i13.p1.1),[4th item](https://arxiv.org/html/2605.15488#S4.I2.i4.p1.1),[§5](https://arxiv.org/html/2605.15488#S5.p1.1)\.
- \[54\]C\. Lee, J\. Yoon, and M\. Van Der Schaar\(2019\)Dynamic\-deephit: a deep learning approach for dynamic survival analysis with competing risks based on longitudinal data\.IEEE Transactions on Biomedical Engineering67\(1\),pp\. 122–133\.Cited by:[§5](https://arxiv.org/html/2605.15488#S5.p1.1)\.
- \[55\]C\. Lee, W\. Zame, J\. Yoon, and M\. Van Der Schaar\(2018\)Deephit: a deep learning approach to survival analysis with competing risks\.InProceedings of the AAAI conference on artificial intelligence,Vol\.32\.Cited by:[item 9](https://arxiv.org/html/2605.15488#A5.I1.i9.p1.1),[Appendix H](https://arxiv.org/html/2605.15488#A8.SS0.SSS0.Px2.p2.1),[§1](https://arxiv.org/html/2605.15488#S1.p2.1),[4th item](https://arxiv.org/html/2605.15488#S4.I2.i4.p1.1),[§5](https://arxiv.org/html/2605.15488#S5.p1.1)\.
- \[56\]C\. Li, V\. Patil, K\. M\. Rasmussen, C\. Yong, H\. Chien, D\. Morreall, J\. Humpherys, B\. C\. Sauer, Z\. Burningham, and A\. S\. Halwani\(2021\)Predicting survival in veterans with follicular lymphoma using structured electronic health record information and machine learning\.International Journal of Environmental Research and Public Health18\(5\),pp\. 2679\.Cited by:[§1](https://arxiv.org/html/2605.15488#S1.p1.1)\.
- \[57\]G\. Y\. Lip, R\. Nieuwlaat, R\. Pisters, D\. A\. Lane, and H\. J\. Crijns\(2010\)Refining clinical risk stratification for predicting stroke and thromboembolism in atrial fibrillation using a novel risk factor\-based approach: the euro heart survey on atrial fibrillation\.Chest137\(2\),pp\. 263–272\.Cited by:[§6](https://arxiv.org/html/2605.15488#S6.p1.1)\.
- \[58\]C\. L\. Loprinzi, J\. A\. Laurie, H\. S\. Wieand, J\. E\. Krook, P\. J\. Novotny, J\. W\. Kugler, J\. Bartel, M\. Law, M\. Bateman, and N\. E\. Klatt\(1994\)Prospective evaluation of prognostic variables from patient\-completed questionnaires\. north central cancer treatment group\.\.Journal of Clinical Oncology12\(3\),pp\. 601–607\.Cited by:[5th item](https://arxiv.org/html/2605.15488#A6.I1.i5.p1.1)\.
- \[59\]J\. Lu\(2002\)Predicting customer churn in the telecommunications industry—\-an application of survival analysis modeling using sas\.SAS User Group International \(SUGI27\) Online Proceedings114,pp\. 27\.Cited by:[§1](https://arxiv.org/html/2605.15488#S1.p1.1)\.
- \[60\]M\. Luoma and E\. K\. Laitinen\(1991\)Survival analysis as a tool for company failure prediction\.Omega19\(6\),pp\. 673–678\.Cited by:[§1](https://arxiv.org/html/2605.15488#S1.p1.1)\.
- \[61\]J\. Ma, V\. Thomas, R\. Hosseinzadeh, A\. Labach, H\. Kamkari, J\. C\. Cresswell, K\. Golestan, G\. Yu, A\. L\. Caterini, and M\. Volkovs\(2024\)Tabdpt: scaling tabular foundation models on real data\.arXiv preprint arXiv:2410\.18164\.Cited by:[§D\.2](https://arxiv.org/html/2605.15488#A4.SS2.p1.6),[§D\.2](https://arxiv.org/html/2605.15488#A4.SS2.p3.1),[§D\.4](https://arxiv.org/html/2605.15488#A4.SS4.SSS0.Px2.p1.6),[item 2](https://arxiv.org/html/2605.15488#A5.I1.i2.p1.1),[§G\.4](https://arxiv.org/html/2605.15488#A7.SS4.p1.1),[§3\.1](https://arxiv.org/html/2605.15488#S3.SS1.p3.3)\.
- \[62\]W\. Mendenhall and R\. Hader\(1958\)Estimation of parameters of mixed exponentially distributed failure time distributions from censored life test data\.Biometrika45\(3\-4\),pp\. 504–520\.Cited by:[§5](https://arxiv.org/html/2605.15488#S5.p1.1)\.
- \[63\]M\. Merrell and L\. E\. Shulman\(1955\)Determination of prognosis in chronic disease, illustrated by systemic lupus erythematosus\.Journal of chronic diseases1\(1\),pp\. 12–32\.Cited by:[3rd item](https://arxiv.org/html/2605.15488#A6.I1.i3.p1.1)\.
- \[64\]X\. Miscouridou, A\. Perotte, N\. Elhadad, and R\. Ranganath\(2018\)Deep survival analysis: nonparametrics and missingness\.InMachine Learning for Healthcare Conference,pp\. 244–256\.Cited by:[§5](https://arxiv.org/html/2605.15488#S5.p1.1)\.
- \[65\]M\. Monod, A\. Micheli, and S\. Bhatt\(2025\)NeuralSurv: deep survival analysis with bayesian uncertainty quantification\.arXiv preprint arXiv:2505\.11054\.Cited by:[§5](https://arxiv.org/html/2605.15488#S5.p2.1)\.
- \[66\]S\. Müller, N\. Hollmann, S\. P\. Arango, J\. Grabocka, and F\. Hutter\(2021\)Transformers can do bayesian inference\.arXiv preprint arXiv:2112\.10510\.Cited by:[§1](https://arxiv.org/html/2605.15488#S1.p4.1),[§2](https://arxiv.org/html/2605.15488#S2.p6.1),[§3\.1](https://arxiv.org/html/2605.15488#S3.SS1.p5.1)\.
- \[67\]S\. Müller, A\. Reuter, N\. Hollmann, D\. Rügamer, and F\. Hutter\(2025\)Position: the future of bayesian prediction is prior\-fitted\.arXiv preprint arXiv:2505\.23947\.Cited by:[§2](https://arxiv.org/html/2605.15488#S2.p6.1)\.
- \[68\]C\. Nagpal, X\. Li, and A\. Dubrawski\(2021\)Deep survival machines: fully parametric survival regression and representation learning for censored data with competing risks\.IEEE Journal of Biomedical and Health Informatics25\(8\),pp\. 3163–3175\.Cited by:[item 17](https://arxiv.org/html/2605.15488#A5.I1.i17.p1.1),[§1](https://arxiv.org/html/2605.15488#S1.p2.1),[5th item](https://arxiv.org/html/2605.15488#S4.I2.i5.p1.1),[§5](https://arxiv.org/html/2605.15488#S5.p1.1)\.
- \[69\]C\. Nagpal, W\. Potosnak, and A\. Dubrawski\(2022\)Auton\-survival: an open\-source package for regression, counterfactual estimation, evaluation and phenotyping with censored time\-to\-event data\.InMachine Learning for Healthcare Conference,pp\. 585–608\.Cited by:[item 17](https://arxiv.org/html/2605.15488#A5.I1.i17.p1.1)\.
- \[70\]R\. M\. Neal\(2012\)Bayesian learning for neural networks\.Vol\.118,Springer Science & Business Media\.Cited by:[§2](https://arxiv.org/html/2605.15488#S2.p5.11)\.
- \[71\]P\. A\. Norman, W\. Li, W\. Jiang, and B\. E\. Chen\(2024\)DeepAFT: a nonlinear accelerated failure time model with artificial neural network\.Statistics in Medicine43\(19\),pp\. 3689–3701\.Cited by:[item 20](https://arxiv.org/html/2605.15488#A5.I1.i20.p1.1),[item 21](https://arxiv.org/html/2605.15488#A5.I1.i21.p1.1),[§1](https://arxiv.org/html/2605.15488#S1.p2.1),[5th item](https://arxiv.org/html/2605.15488#S4.I2.i5.p1.1)\.
- \[72\]D\. Papathanasiou, K\. Demertzis, and N\. Tziritas\(2023\)Machine failure prediction using survival analysis\.Future Internet15\(5\),pp\. 153\.Cited by:[§1](https://arxiv.org/html/2605.15488#S1.p1.1)\.
- \[73\]T\. Pearce, J\. Jeong, J\. Zhu,et al\.\(2022\)Censored quantile regression neural networks for distribution\-free survival analysis\.Advances in neural information processing systems35,pp\. 7450–7461\.Cited by:[item 15](https://arxiv.org/html/2605.15488#A5.I1.i15.p1.1),[§1](https://arxiv.org/html/2605.15488#S1.p2.1),[4th item](https://arxiv.org/html/2605.15488#S4.I2.i4.p1.1)\.
- \[74\]Á\. Periáñez, A\. Saas, A\. Guitart, and C\. Magne\(2016\)Churn prediction in mobile social games: towards a complete assessment using survival ensembles\.In2016 IEEE international conference on data science and advanced analytics \(DSAA\),pp\. 564–573\.Cited by:[§1](https://arxiv.org/html/2605.15488#S1.p1.1)\.
- \[75\]R\. Peto and P\. Lee\(1973\)Weibull distributions for continuous\-carcinogenesis experiments\.Biometrics,pp\. 457–470\.Cited by:[§5](https://arxiv.org/html/2605.15488#S5.p1.1)\.
- \[76\]B\. Pineiro\-Lamas, A\. Lopez\-Cheda, R\. Cao, L\. Ramos\-Alonso, G\. Gonzalez\-Barbeito, C\. Barbeito\-Caamano, and A\. Bouzas\-Mosquera\(2023\)A cardiotoxicity dataset for breast cancer patients\.Scientific Data10\(1\),pp\. 527\.Cited by:[8th item](https://arxiv.org/html/2605.15488#A6.I1.i8.p1.1)\.
- \[77\]S\. Pölsterl, N\. Navab, and A\. Katouzian\(2015\)Fast training of support vector machines for survival analysis\.InJoint European conference on machine learning and knowledge discovery in databases,pp\. 243–259\.Cited by:[item 5](https://arxiv.org/html/2605.15488#A5.I1.i5.p1.1),[2nd item](https://arxiv.org/html/2605.15488#S4.I2.i2.p1.1)\.
- \[78\]S\. Pölsterl\(2020\)Scikit\-survival: a library for time\-to\-event analysis built on top of scikit\-learn\.Journal of Machine Learning Research21\(212\),pp\. 1–6\.Cited by:[item 3](https://arxiv.org/html/2605.15488#A5.I1.i3.p1.1),[item 4](https://arxiv.org/html/2605.15488#A5.I1.i4.p1.1),[item 5](https://arxiv.org/html/2605.15488#A5.I1.i5.p1.1),[item 6](https://arxiv.org/html/2605.15488#A5.I1.i6.p1.1),[item 7](https://arxiv.org/html/2605.15488#A5.I1.i7.p1.1),[item 8](https://arxiv.org/html/2605.15488#A5.I1.i8.p1.1)\.
- \[79\]S\. Qi, N\. Kumar, M\. Farrokh, W\. Sun, L\. Kuan, R\. Ranganath, R\. Henao, and R\. Greiner\(2023\)An effective meaningful way to evaluate survival models\.Proceedings of machine learning research202,pp\. 28244\.Cited by:[§E\.4](https://arxiv.org/html/2605.15488#A5.SS4.SSS0.Px3.p1.3),[21st item](https://arxiv.org/html/2605.15488#A6.I1.i21.p1.1),[§2](https://arxiv.org/html/2605.15488#S2.p5.11),[§4](https://arxiv.org/html/2605.15488#S4.p4.1)\.
- \[80\]S\. Qi, N\. Kumar, R\. Verma, J\. Xu, G\. Shen\-Tu, and R\. Greiner\(2023\)Using bayesian neural networks to select features and compute credible intervals for personalized survival prediction\.IEEE Transactions on Biomedical Engineering70\(12\),pp\. 3389–3400\.Cited by:[item 16](https://arxiv.org/html/2605.15488#A5.I1.i16.p1.1),[4th item](https://arxiv.org/html/2605.15488#S4.I2.i4.p1.1),[§5](https://arxiv.org/html/2605.15488#S5.p2.1)\.
- \[81\]S\. Qi, N\. Kumar, J\. Xu, J\. Patel, S\. Damaraju, G\. Shen\-Tu, and R\. Greiner\(2022\)Personalized breast cancer onset prediction from lifestyle and health history information\.Plos one17\(12\),pp\. e0279174\.Cited by:[§1](https://arxiv.org/html/2605.15488#S1.p1.1)\.
- \[82\]S\. Qi, W\. Sun, and R\. Greiner\(2023\)SurvivalEVAL: a comprehensive open\-source python package for evaluating individual survival distributions\.InProceedings of the AAAI Symposium Series,Vol\.2,pp\. 453–457\.Cited by:[§E\.4](https://arxiv.org/html/2605.15488#A5.SS4.SSS0.Px5.p2.1)\.
- \[83\]S\. Qi, Y\. Yu, and R\. Greiner\(2024\)Conformalized survival distributions: a generic post\-process to increase calibration\.InInternational Conference on Machine Learning,pp\. 41303–41339\.Cited by:[23rd item](https://arxiv.org/html/2605.15488#A6.I1.i23.p1.1),[24th item](https://arxiv.org/html/2605.15488#A6.I1.i24.p1.1)\.
- \[84\]J\. Qu, D\. Holzmüller, G\. Varoquaux, and M\. L\. Morvan\(2026\)TabICLv2: a better, faster, scalable, and open tabular foundation model\.arXiv preprint arXiv:2602\.11139\.Cited by:[§G\.4](https://arxiv.org/html/2605.15488#A7.SS4.p1.1),[Appendix H](https://arxiv.org/html/2605.15488#A8.SS0.SSS0.Px1.p1.4),[§2](https://arxiv.org/html/2605.15488#S2.p6.1)\.
- \[85\]R\. Ranganath, A\. Perotte, N\. Elhadad, and D\. Blei\(2016\)Deep survival analysis\.InMachine Learning for Healthcare Conference,pp\. 101–114\.Cited by:[§1](https://arxiv.org/html/2605.15488#S1.p2.1),[§5](https://arxiv.org/html/2605.15488#S5.p1.1)\.
- \[86\]G\. Ridgeway\(1999\)The state of boosting\.Computing science and statistics,pp\. 172–181\.Cited by:[item 6](https://arxiv.org/html/2605.15488#A5.I1.i6.p1.1),[3rd item](https://arxiv.org/html/2605.15488#S4.I2.i3.p1.1)\.
- \[87\]D\. Rindt, R\. Hu, D\. Steinsaltz, and D\. Sejdinovic\(2022\)Survival regression with proper scoring rules and monotonic neural networks\.InInternational Conference on Artificial Intelligence and Statistics,pp\. 1190–1205\.Cited by:[item 18](https://arxiv.org/html/2605.15488#A5.I1.i18.p1.1),[5th item](https://arxiv.org/html/2605.15488#S4.I2.i5.p1.1),[§5](https://arxiv.org/html/2605.15488#S5.p1.1)\.
- \[88\]J\. Robertson, A\. Reuter, S\. Guo, N\. Hollmann, F\. Hutter, and B\. Schölkopf\(2025\)Do\-pfn: in\-context learning for causal effect estimation\.arXiv preprint arXiv:2506\.06039\.Cited by:[§2](https://arxiv.org/html/2605.15488#S2.p6.1)\.
- \[89\]J\. M\. Robins and D\. M\. Finkelstein\(2000\)Correcting for noncompliance and dependent censoring in an aids clinical trial with inverse probability of censoring weighted \(ipcw\) log\-rank tests\.Biometrics56\(3\),pp\. 779–788\.Cited by:[§E\.4](https://arxiv.org/html/2605.15488#A5.SS4.SSS0.Px2.p1.4)\.
- \[90\]P\. H\. Rossi, R\. A\. Berk, and K\. J\. Lenihan\(1980\)Money, work and crime: some experimental results\.New York: Academic Press\.Cited by:[7th item](https://arxiv.org/html/2605.15488#A6.I1.i7.p1.1)\.
- \[91\]D\. Seletkov, P\. Hager, R\. Braren, D\. Rueckert, and R\. Rehms\(2026\)Survival in\-context: prior\-fitted in\-context learning tabular foundation model for survival analysis\.arXiv preprint arXiv:2603\.29475\.Cited by:[Appendix H](https://arxiv.org/html/2605.15488#A8.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2605.15488#S5.p3.1)\.
- \[92\]Y\. She, Z\. Jin, J\. Wu, J\. Deng, L\. Zhang, H\. Su, G\. Jiang, H\. Liu, D\. Xie, N\. Cao,et al\.\(2020\)Development and validation of a deep learning model for non–small cell lung cancer survival\.JAMA network open3\(6\),pp\. e205842–e205842\.Cited by:[§1](https://arxiv.org/html/2605.15488#S1.p1.1)\.
- \[93\]M\. Sikora, Ł\. Wróbel, and A\. Gudyś\(2019\)GuideR: a guided separate\-and\-conquer rule learning in classification, regression, and survival settings\.Knowledge\-Based Systems173,pp\. 1–14\.Cited by:[4th item](https://arxiv.org/html/2605.15488#A6.I1.i4.p1.1)\.
- \[94\]N\. Simon, J\. H\. Friedman, T\. Hastie, and R\. Tibshirani\(2011\)Regularization paths for cox’s proportional hazards model via coordinate descent\.Journal of statistical software39,pp\. 1–13\.Cited by:[item 4](https://arxiv.org/html/2605.15488#A5.I1.i4.p1.1),[2nd item](https://arxiv.org/html/2605.15488#S4.I2.i2.p1.1)\.
- \[95\]C\. G\. A\. R\. N\. T\. source sites: Duke University Medical School McLendon Roger 1 Friedman Allan 2 Bigner Darrell 1, E\. U\. V\. M\. E\. G\. 3\. 4\. 5\. B\. D\. J\. 5\. 6\. M\. M\. G\. 3\. O\. J\. J\. 3\. 4\. 5, H\. F\. H\. M\. T\. 7\. L\. N\. 8, M\. A\. C\. C\. A\. K\. 9\. A\. Y\. W\. 1\. B\. O\. 11, U\. of California San Francisco VandenBerg Scott 12 Berger Mitchel 13 Prados Michael 13,et al\.\(2008\)Comprehensive genomic characterization defines human glioblastoma genes and core pathways\.Nature455\(7216\),pp\. 1061–1068\.Cited by:[9th item](https://arxiv.org/html/2605.15488#A6.I1.i9.p1.1)\.
- \[96\]L\. Tang, C\. Li, J\. Li, W\. Chen, Q\. Chen, L\. Yuan, X\. Lai, Y\. He, Y\. Xu, D\. Hu,et al\.\(2016\)Establishment and validation of prognostic nomograms for endemic nasopharyngeal carcinoma\.Journal of the National Cancer Institute108\(1\),pp\. djv291\.Cited by:[19th item](https://arxiv.org/html/2605.15488#A6.I1.i19.p1.1)\.
- \[97\]W\. Tang, J\. Ma, Q\. Mei, and J\. Zhu\(2022\)Soden: a scalable continuous\-time survival model through ordinary differential equation networks\.Journal of Machine Learning Research23\(34\),pp\. 1–29\.Cited by:[§5](https://arxiv.org/html/2605.15488#S5.p1.1)\.
- \[98\]A\. A\. Tsiatis\(2006\)Semiparametric theory and missing data\.Vol\.4,Springer\.Cited by:[§2](https://arxiv.org/html/2605.15488#S2.p2.2)\.
- \[99\]A\. Tsiatis\(1975\)A nonidentifiability aspect of the problem of competing risks\.\.Proceedings of the National Academy of Sciences72\(1\),pp\. 20–22\.Cited by:[Appendix B](https://arxiv.org/html/2605.15488#A2.1.p1.1),[Appendix B](https://arxiv.org/html/2605.15488#A2.p2.5),[§2](https://arxiv.org/html/2605.15488#S2.p2.2)\.
- \[100\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin\(2017\)Attention is all you need\.Advances in neural information processing systems30\.Cited by:[§1](https://arxiv.org/html/2605.15488#S1.p4.1)\.
- \[101\]M\. J\. Wainwright, M\. I\. Jordan,et al\.\(2008\)Graphical models, exponential families, and variational inference\.Foundations and Trends® in Machine Learning1\(1–2\),pp\. 1–305\.Cited by:[§2](https://arxiv.org/html/2605.15488#S2.p5.11)\.
- \[102\]M\. Welling and Y\. W\. Teh\(2011\)Bayesian learning via stochastic gradient langevin dynamics\.InProceedings of the 28th international conference on machine learning \(ICML\-11\),pp\. 681–688\.Cited by:[§2](https://arxiv.org/html/2605.15488#S2.p5.11)\.
- \[103\]C\. Yu, R\. Greiner, H\. Lin, and V\. Baracos\(2011\)Learning patient\-specific cancer survival distributions as a sequence of dependent regressors\.Advances in neural information processing systems24\.Cited by:[item 11](https://arxiv.org/html/2605.15488#A5.I1.i11.p1.1),[18th item](https://arxiv.org/html/2605.15488#A6.I1.i18.p1.1),[§1](https://arxiv.org/html/2605.15488#S1.p2.1),[4th item](https://arxiv.org/html/2605.15488#S4.I2.i4.p1.1),[§5](https://arxiv.org/html/2605.15488#S5.p1.1)\.
- \[104\]A\. Zehir, R\. Benayed, R\. H\. Shah, A\. Syed, S\. Middha, H\. R\. Kim, P\. Srinivasan, J\. Gao, D\. Chakravarty, S\. M\. Devlin,et al\.\(2017\)Mutational landscape of metastatic cancer revealed from prospective clinical sequencing of 10,000 patients\.Nature medicine23\(6\),pp\. 703–713\.Cited by:[20th item](https://arxiv.org/html/2605.15488#A6.I1.i20.p1.1)\.
- \[105\]W\. Zhang, C\. K\. Ling, and X\. Zhang\(2024\)Deep copula\-based survival analysis for dependent censoring with identifiability guarantees\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 20613–20621\.Cited by:[§2](https://arxiv.org/html/2605.15488#S2.p4.4),[§6](https://arxiv.org/html/2605.15488#S6.p2.1)\.
- \[106\]X\. Zhang, D\. C\. Maddix, J\. Yin, N\. Erickson, A\. F\. Ansari, B\. Han, S\. Zhang, L\. Akoglu, C\. Faloutsos, M\. W\. Mahoney,et al\.\(2025\)Mitra: mixed synthetic priors for enhancing tabular foundation models\.arXiv preprint arXiv:2510\.21204\.Cited by:[item 2](https://arxiv.org/html/2605.15488#A5.I1.i2.p1.1),[§G\.4](https://arxiv.org/html/2605.15488#A7.SS4.p1.1),[Appendix H](https://arxiv.org/html/2605.15488#A8.SS0.SSS0.Px1.p1.4)\.
###### Appendix Contents
1. [1Introduction](https://arxiv.org/html/2605.15488#S1)
2. [2Background](https://arxiv.org/html/2605.15488#S2)
3. [3Method](https://arxiv.org/html/2605.15488#S3)1. [3\.1SurvivalPFN: Amortized Posterior Predictive Inference](https://arxiv.org/html/2605.15488#S3.SS1) 2. [3\.2Consistency of the Bayesian Posterior Predictive Target](https://arxiv.org/html/2605.15488#S3.SS2) 3. [3\.3Prior over Identifiable Survival DGPs](https://arxiv.org/html/2605.15488#S3.SS3)
4. [4Experiments and Results](https://arxiv.org/html/2605.15488#S4)
5. [5Related Work](https://arxiv.org/html/2605.15488#S5)
6. [6Conclusions, Limitations, and Future Work](https://arxiv.org/html/2605.15488#S6)
7. [References](https://arxiv.org/html/2605.15488#bib)
8. [ANotation](https://arxiv.org/html/2605.15488#A1)
9. [BIdentifiability and Non\-identifiability under Right Censoring](https://arxiv.org/html/2605.15488#A2)
10. [CPosterior\-Predictive Consistency](https://arxiv.org/html/2605.15488#A3)1. [C\.1Notation and regularity assumptions](https://arxiv.org/html/2605.15488#A3.SS1) 2. [C\.2Observed\-law equivalence and survival identifiability](https://arxiv.org/html/2605.15488#A3.SS2) 3. [C\.3Consistency of the Bayesian PPSD](https://arxiv.org/html/2605.15488#A3.SS3) 4. [C\.4Implication for SurvivalPFN](https://arxiv.org/html/2605.15488#A3.SS4)
11. [DAdditional SurvivalPFN Details](https://arxiv.org/html/2605.15488#A4)1. [D\.1Synthetic Prior and DGP Simulation](https://arxiv.org/html/2605.15488#A4.SS1) 2. [D\.2Architecture Details](https://arxiv.org/html/2605.15488#A4.SS2) 3. [D\.3Monotone Time Transformations](https://arxiv.org/html/2605.15488#A4.SS3) 4. [D\.4Training Objective](https://arxiv.org/html/2605.15488#A4.SS4) 5. [D\.5Inference Procedure](https://arxiv.org/html/2605.15488#A4.SS5)
12. [EBenchmarking Details](https://arxiv.org/html/2605.15488#A5)1. [E\.1Baseline Model Details](https://arxiv.org/html/2605.15488#A5.SS1) 2. [E\.2Unified Time Grid for Consistent Evaluation](https://arxiv.org/html/2605.15488#A5.SS2) 3. [E\.3Hyperparameter Tuning](https://arxiv.org/html/2605.15488#A5.SS3) 4. [E\.4Evaluation Metrics](https://arxiv.org/html/2605.15488#A5.SS4) 5. [E\.5Compute Environment and Runtime Protocol](https://arxiv.org/html/2605.15488#A5.SS5) 6. [E\.6Performance Reporting Protocol](https://arxiv.org/html/2605.15488#A5.SS6)
13. [FBenchmark Dataset Details](https://arxiv.org/html/2605.15488#A6)
14. [GAdditional Experimental Results](https://arxiv.org/html/2605.15488#A7)1. [G\.1RQ1: Predictive Performance](https://arxiv.org/html/2605.15488#A7.SS1) 2. [G\.2RQ2: Computational Efficiency](https://arxiv.org/html/2605.15488#A7.SS2) 3. [G\.3RQ3: Sensitivity to Training\-Set Size](https://arxiv.org/html/2605.15488#A7.SS3) 4. [G\.4RQ4: Compare with General Tabular Foundational Models](https://arxiv.org/html/2605.15488#A7.SS4) 5. [G\.5RQ5: Ablation Studies](https://arxiv.org/html/2605.15488#A7.SS5)
15. [HConcurrent Works on Tabular Foundation Models for Survival Analysis](https://arxiv.org/html/2605.15488#A8)
## Appendix ANotation
TableLABEL:tab:notation\-summarysummarizes all the notation used throughout the paper\. Following the convention, we use uppercase letters for random variables and lowercase letters for realizations\.
Table 1:Summary of notation in the paper\.NotationDefinition𝒳\\mathcal\{X\},ℝ\+\\mathbb\{R\}\_\{\+\}Covariate space and nonnegative time domain\.i∈\{1,…,N\}i\\in\\\{1,\\ldots,N\\\}Index for an individual observation\.ddNumber of covariates/features\.X∈𝒳X\\in\\mathcal\{X\}Covariates\.E∈ℝ\+E\\in\\mathbb\{R\}\_\{\+\}Latent event time of interest\.C∈ℝ\+C\\in\\mathbb\{R\}\_\{\+\}Latent censoring time\.T=min\(E,C\)T=\\min\(E,C\)Observed follow\-up time\.Δ=𝟙\{E≤C\}\\Delta=\\mathbbm\{1\}\\\{E\\leq C\\\}Event indicator:Δ=1\\Delta=1for observed event andΔ=0\\Delta=0for right\-censored\.\(xi,ti,δi\)\(x\_\{i\},t\_\{i\},\\delta\_\{i\}\)Observed right\-censored tuple\.\(ei,ci\)\(e\_\{i\},c\_\{i\}\)Latent event and censoring times\.𝒟\\mathscr\{D\}Random observed dataset\.𝒟=\{\(xi,ti,δi\)\}i=1N\\mathcal\{D\}=\\\{\(x\_\{i\},t\_\{i\},\\delta\_\{i\}\)\\\}\_\{i=1\}^\{N\}Realization of𝒟\\mathscr\{D\}\.𝒟θtr\\mathcal\{D\}^\{tr\}\_\{\\theta\},𝒟θte\\mathcal\{D\}^\{te\}\_\{\\theta\}Synthetic context/training set and query/test set sampled from DGP parameterθ\\thetaduring prior\-data pretraining\.ω\\omegaTrainable parameters of SurvivalPFN\.θ∈Θ\\theta\\in\\ThetaLatent parameter indexing a survival data\-generating process\.θ∗\\theta^\{\\ast\}True DGP parameter for a downstream dataset\.π\(⋅\)\\pi\(\\cdot\)Prior distribution over survival DGPs used to generate synthetic training tasks\.PθP^\{\\theta\}Joint law of\(X,E,C,T,Δ\)\(X,E,C,T,\\Delta\)under DGP parameterθ\\theta\.PobsθP^\{\\theta\}\_\{\\mathrm\{obs\}\}Observational law of\(X,T,Δ\)\(X,T,\\Delta\)induced byPθP^\{\\theta\}\.fE∣Xθ\(t∣x\)f^\{\\theta\}\_\{E\\mid X\}\(t\\mid x\)Conditional density function for event time, underθ\\theta\.FE∣Xθ\(t∣x\)F^\{\\theta\}\_\{E\\mid X\}\(t\\mid x\)Conditional CDF for event time,Pθ\(E≤t∣X=x\)P^\{\\theta\}\(E\\leq t\\mid X=x\)\.SE∣Xθ\(t∣x\)S^\{\\theta\}\_\{E\\mid X\}\(t\\mid x\)Conditional survival function for event time,Pθ\(E\>t∣X=x\)=1−FE∣Xθ\(t∣x\)P^\{\\theta\}\(E\>t\\mid X=x\)=1\-F^\{\\theta\}\_\{E\\mid X\}\(t\\mid x\)\.λE∣Xθ\(t∣x\)\\lambda^\{\\theta\}\_\{E\\mid X\}\(t\\mid x\)Conditional hazard function,λEθ\(t∣x\)=fE∣Xθ\(t∣x\)/SE∣Xθ\(t∣x\)\\lambda^\{\\theta\}\_\{E\}\(t\\mid x\)=f^\{\\theta\}\_\{E\\mid X\}\(t\\mid x\)/S^\{\\theta\}\_\{E\\mid X\}\(t\\mid x\)\.SC∣Xθ\(t∣x\)S^\{\\theta\}\_\{C\\mid X\}\(t\\mid x\)Conditional survival function for censoring time,Pθ\(C≥t∣X=x\)P^\{\\theta\}\(C\\geq t\\mid X=x\)\.λC∣Xθ\(t∣x\)\\lambda^\{\\theta\}\_\{C\\mid X\}\(t\\mid x\)Conditional censoring hazard\.fE∣X,𝒟\(t∣x,𝒟\)f\_\{E\\mid X,\\mathscr\{D\}\}\(t\\mid x,\\mathcal\{D\}\)Posterior predictive event\-time distribution\.SE∣X,𝒟\(t∣x,𝒟\)S\_\{E\\mid X,\\mathscr\{D\}\}\(t\\mid x,\\mathcal\{D\}\)Posterior predictive survival distribution \(PPSD\)\.x∗x^\{\\ast\}Query covariates for a new individual\.δ~∗\\widetilde\{\\delta\}^\{\\ast\}Query indicator supplied to SurvivalPFN\.qω\(⋅∣x∗,δ~∗,𝒟\)q\_\{\\omega\}\(\\cdot\\mid x^\{\\ast\},\\widetilde\{\\delta\}^\{\\ast\},\\mathcal\{D\}\)SurvivalPFN’s predictive distribution over discretized time bins\.LLNumber of time bins used to represent the predictive distribution\.\{ℐℓ\}ℓ=1L\\\{\\mathcal\{I\}\_\{\\ell\}\\\}\_\{\\ell=1\}^\{L\}SurvivalPFN’s transformed\-time bins used to represent the discretized predictive distribution\.𝒢=\{t1<⋯<tm\}\\mathcal\{G\}=\\\{t\_\{1\}<\\cdots<t\_\{m\}\\\}Discrete time grid for benchmarking discrete\-time survival models\.CV\(T\)\\mathrm\{CV\}\(T\)Coefficient of variation of observed times:CV\(T\)=sd\(T\)/𝔼\[T\]\\mathrm\{CV\}\(T\)=\\operatorname\{sd\}\(T\)/\\mathbb\{E\}\[T\], when used in DGP diagnostics\.
## Appendix BIdentifiability and Non\-identifiability under Right Censoring
In the context of survival analysis and causal inference, identifiability is the prerequisite for learning\. If a model is non\-identifiable, infinite data cannot distinguish between multiple underlying “truths” \(e\.g\., whether a drug works or if patients are simply dropping out due to side effects\)\. Without identifiability, no consistent estimator exists, and any conclusions drawn from the data rely entirely on untestable assumptions rather than empirical evidence\.
Tsiatis \[[99](https://arxiv.org/html/2605.15488#bib.bib89), Theorem 2\]first proved that the latent joint distribution of event timeEEand censoring timeCCis not identifiable from the \(infinite\) observed data\(T,Δ\)\(T,\\Delta\)without additional assumptions\. Formally, we restate the theorem using the notation that is consistent within this paper, by consideringEEandCCas two competing events:
###### Theorem B\.1\(Non\-identifiability; Marginal\)\.
LetSE,C\(e,c\)=P\(E\>e,C\>c\)S\_\{E,C\}\(e,c\)=P\(\\,E\>e,C\>c\\,\)be an arbitrary joint survival function whereEEandCCare dependent\. There exists a different joint survival functionSE,C∗\(e,c\)S\_\{E,C\}^\{\\ast\}\(e,c\), constructed such thatEEandCCare independent –SE,C∗\(e,c\)=SE∗\(e\)SC∗\(c\)S\_\{E,C\}^\{\\ast\}\(e,c\)=S\_\{E\}^\{\\ast\}\(e\)\\,S\_\{C\}^\{\\ast\}\(c\)– which generates the exact same observed data distributionP\(T,Δ\)P\(T,\\Delta\)\.
This means, without the assumption of random censoring \(E⟂CE\\perp C\), the marginal survival functionSE∣X\(t\)S\_\{E\\mid X\}\(t\)cannot be uniquely determined\. This theorem can be easily extend to the conditional setting:
###### Theorem B\.2\(Non\-identifiability; Conditional\)\.
LetXXbe a set of covariates\. Let the true conditional joint survival function beSE,C∣X\(e,c∣x\)=P\(E\>e,C\>c∣X=x\)S\_\{E,C\\mid X\}\(e,c\\mid x\)=P\(E\>e,C\>c\\mid X=x\), whereE⟂̸C∣XE\\not\\perp C\\mid X\. For any such dependent model, there exists a valid conditional independent modelSE,C∣X∗\(e,c∣x\)=SE∣X∗\(e∣x\)SC∣X∗\(c∣x\)S\_\{E,C\\mid X\}^\{\\ast\}\(e,c\\mid x\)=S\_\{E\\mid X\}^\{\\ast\}\(e\\mid x\)S\_\{C\\mid X\}^\{\\ast\}\(c\\mid x\)such that the observable distributions of\(T,Δ∣X\)\(T,\\Delta\\mid X\)are identical\.
###### Proof\.
While the proof for this conditional non\-identifiability is straightforward via followingTsiatis \[[99](https://arxiv.org/html/2605.15488#bib.bib89)\]’s step, we present the proof for completeness\.
Let the observed data be characterized by the conditional sub\-survival functions:
Φ\(t∣x\)\\displaystyle\\Phi\(t\\mid x\)\\=P\(T\>t,Δ=1∣x\)\\displaystyle=\\ P\(T\>t,\\Delta=1\\mid x\)Ψ\(t∣x\)\\displaystyle\\Psi\(t\\mid x\)\\=P\(T\>t,Δ=0∣x\)\\displaystyle=\\ P\(T\>t,\\Delta=0\\mid x\)These functions completely describe the likelihood of the observed data\.
We construct a “proxy” independent world, denoted by a superscriptind, by defining its conditional hazards to match the observed cause\-specific hazards of the original world:
λE∣Xind\(t∣x\)\\displaystyle\\lambda\_\{E\\mid X\}^\{\\mathrm\{ind\}\}\(t\\mid x\)\\=limdt→0Prind\(t≤T<t\+dt,Δ=1∣T≥t,X=x\)dt=−∂∂tΦ\(t∣x\)Φ\(t∣x\)\+Ψ\(t∣x\)\\displaystyle=\\ \\lim\_\{dt\\rightarrow 0\}\\frac\{\\Pr^\{\\mathrm\{ind\}\}\(t\\leq T<t\+dt,\\Delta=1\\mid T\\geq t,X=x\)\}\{dt\}\\ =\\ \\frac\{\-\\frac\{\\partial\}\{\\partial t\}\\Phi\(t\\mid x\)\}\{\\Phi\(t\\mid x\)\+\\Psi\(t\\mid x\)\}λC∣Xind\(t∣x\)\\displaystyle\\lambda\_\{C\\mid X\}^\{\\mathrm\{ind\}\}\(t\\mid x\)\\=limdt→0Prind\(t≤T<t\+dt,Δ=0∣T≥t,X=x\)dt=−∂∂tΨ\(t∣x\)Φ\(t∣x\)\+Ψ\(t∣x\)\\displaystyle=\\ \\lim\_\{dt\\rightarrow 0\}\\frac\{\\Pr^\{\\mathrm\{ind\}\}\(t\\leq T<t\+dt,\\Delta=0\\mid T\\geq t,X=x\)\}\{dt\}\\ =\\ \\frac\{\-\\frac\{\\partial\}\{\\partial t\}\\Psi\(t\\mid x\)\}\{\\Phi\(t\\mid x\)\+\\Psi\(t\\mid x\)\}We define the marginal survival functions in the proxy world as:
SE∣Xind\(t∣x\)\\displaystyle S\_\{E\\mid X\}^\{\\mathrm\{ind\}\}\(t\\mid x\)\\=exp\(−∫0tλE∣Xind\(u∣x\)𝑑u\),\\displaystyle=\\ \\exp\\left\(\-\\int\_\{0\}^\{t\}\\lambda\_\{E\\mid X\}^\{\\mathrm\{ind\}\}\(u\\mid x\)du\\right\),SC∣Xind\(t∣x\)\\displaystyle S\_\{C\\mid X\}^\{\\mathrm\{ind\}\}\(t\\mid x\)\\=exp\(−∫0tλC∣Xind\(u∣x\)𝑑u\)\.\\displaystyle=\\ \\exp\\left\(\-\\int\_\{0\}^\{t\}\\lambda\_\{C\\mid X\}^\{\\mathrm\{ind\}\}\(u\\mid x\)du\\right\)\.
And the joint distribution \(with conditional independence\):
SE,C∣Xind\(e,c∣x\)=SE∣Xind\(e∣x\)SC∣Xind\(c∣x\)\\displaystyle S\_\{E,C\\mid X\}^\{\\mathrm\{ind\}\}\(e,c\\mid x\)\\ =\\ S\_\{E\\mid X\}^\{\\mathrm\{ind\}\}\(e\\mid x\)\\,S\_\{C\\mid X\}^\{\\mathrm\{ind\}\}\(c\\mid x\)To verify that this proxy world generates the same dataΦ\(t∣x\)\\Phi\(t\\mid x\), we calculate the probability of observing an event in the proxy world:
Φind\(t∣x\)\\displaystyle\\Phi^\{\\mathrm\{ind\}\}\(t\\mid x\)=∫t∞fE∣Xind\(u∣x\)SC∣Xind\(u∣x\)𝑑u\\displaystyle=\\int\_\{t\}^\{\\infty\}f\_\{E\\mid X\}^\{\\mathrm\{ind\}\}\(u\\mid x\)S\_\{C\\mid X\}^\{\\mathrm\{ind\}\}\(u\\mid x\)du=∫t∞λE∣Xind\(u∣x\)SE∣Xind\(u∣x\)SC∣Xind\(u∣x\)𝑑u\\displaystyle=\\int\_\{t\}^\{\\infty\}\\lambda\_\{E\\mid X\}^\{\\mathrm\{ind\}\}\(u\\mid x\)S\_\{E\\mid X\}^\{\\mathrm\{ind\}\}\(u\\mid x\)S\_\{C\\mid X\}^\{\\mathrm\{ind\}\}\(u\\mid x\)du=∫t∞λE∣Xind\(u∣x\)Sind\(u,u∣x\)𝑑u\\displaystyle=\\int\_\{t\}^\{\\infty\}\\lambda\_\{E\\mid X\}^\{\\mathrm\{ind\}\}\(u\\mid x\)S^\{\\mathrm\{ind\}\}\(u,u\\mid x\)duSubstituting the definitions ofλE∣Xind\\lambda\_\{E\\mid X\}^\{\\mathrm\{ind\}\}and noting thatSind\(u,u∣x\)=Φ\(u∣x\)\+Ψ\(u∣x\)S^\{\\mathrm\{ind\}\}\(u,u\\mid x\)=\\Phi\(u\\mid x\)\+\\Psi\(u\\mid x\)\(the overall survival probability matches the sum of sub\-survival functions\):
Φind\(t∣x\)\\displaystyle\\Phi^\{\\mathrm\{ind\}\}\(t\\mid x\)=∫t∞\(−∂∂uΦ\(u∣x\)Φ\(u∣x\)\+Ψ\(u∣x\)\)\(Φ\(u∣x\)\+Ψ\(u∣x\)\)𝑑u\\displaystyle=\\int\_\{t\}^\{\\infty\}\\left\(\\frac\{\-\\frac\{\\partial\}\{\\partial u\}\\Phi\(u\\mid x\)\}\{\\Phi\(u\\mid x\)\+\\Psi\(u\\mid x\)\}\\right\)\(\\Phi\(u\\mid x\)\+\\Psi\(u\\mid x\)\)du=∫t∞−∂∂uΦ\(u∣x\)du\\displaystyle=\\int\_\{t\}^\{\\infty\}\-\\frac\{\\partial\}\{\\partial u\}\\Phi\(u\\mid x\)du=Φ\(t∣x\)\\displaystyle=\\Phi\(t\\mid x\)SinceΦind\(t∣x\)=Φ\(t∣x\)\\Phi^\{\\mathrm\{ind\}\}\(t\\mid x\)=\\Phi\(t\\mid x\)\(and by symmetryΨind\(t∣x\)=Ψ\(t∣x\)\\Psi^\{\\mathrm\{ind\}\}\(t\\mid x\)=\\Psi\(t\\mid x\)\), the independent modelSindS^\{\\mathrm\{ind\}\}is indistinguishable from the true dependent modelSS\. ∎
While Theorem[B\.2](https://arxiv.org/html/2605.15488#A2.Thmtheorem2)states that we cannot distinguishdependentfromindependentcensoring using data alone, the corollary below establishes that if we are willing to assume independence \(either marginal or conditional\), the latent event distribution becomes identifiable\.
###### Corollary B\.1\(Identifiability under Independence\)\.
Suppose we restrict our attention to the class of models that satisfy conditional independent censoring,E⟂C∣XE\\perp C\\mid X\(includingE⟂CE\\perp C\)\. Then, the marginal survival function of the event,SE∣X\(t∣x\)S\_\{E\\mid X\}\(t\\mid x\), is uniquely identifiable from the observed data distribution\. That is, if two modelsSSandSindS^\{\\mathrm\{ind\}\}both satisfy conditional independence but have different event marginals \(SE∣X≠SE∣XindS\_\{E\\mid X\}\\neq S\_\{E\\mid X\}^\{\\mathrm\{ind\}\}\), they must generate distinct observed data distributions\.
###### Proof\.
We prove this by contradiction\. LetλE∣X\(t∣x\)\\lambda\_\{E\\mid X\}\(t\\mid x\)denote thecause\-specifichazard derived purely from the data\(X,T,Δ\)\(X,T,\\Delta\):
λE∣X\(t∣x\)\\displaystyle\\lambda\_\{E\\mid X\}\(t\\mid x\)\\=limdt→0P\(t≤T<t\+dt,Δ=1∣T≥t,X=x\)dt\\displaystyle=\\ \\lim\_\{dt\\rightarrow 0\}\\frac\{P\(t\\leq T<t\+dt,\\Delta=1\\mid T\\geq t,X=x\)\}\{dt\}Letλ\(t∣x\)\\lambda\(t\\mid x\)denote thenethazard of the event of interest:
λ\(t∣x\)\\displaystyle\\lambda\(t\\mid x\)\\=limdt→0P\(t≤E<t\+dt∣E≥t,X=x\)dt\\displaystyle=\\ \\lim\_\{dt\\rightarrow 0\}\\frac\{P\(t\\leq E<t\+dt\\mid E\\geq t,X=x\)\}\{dt\}Under the assumption of conditional independence \(E⟂C∣XE\\perp C\\mid X\), standard survival theory dictates that the net hazard is equal to the cause\-specific hazard:λE∣X\(t∣x\)=λ\(t∣x\)\\lambda\_\{E\\mid X\}\(t\\mid x\)=\\lambda\(t\\mid x\)\. Since the hazard function uniquely defines the survival function viaSE∣X\(t∣x\)=exp\(−∫0tλE∣X\(u∣x\)𝑑u\)S\_\{E\\mid X\}\(t\\mid x\)=\\exp\(\-\\int\_\{0\}^\{t\}\\lambda\_\{E\\mid X\}\(u\\mid x\)du\), the latent distributionSE∣XS\_\{E\\mid X\}is uniquely determined by the observed functionλE∣X\\lambda\_\{E\\mid X\}\.
Now, consider two models with different marginals,SE∣X\(1\)\(t∣x\)≠SE∣X\(2\)\(t∣x\)S\_\{E\\mid X\}^\{\(1\)\}\(t\\mid x\)\\neq S\_\{E\\mid X\}^\{\(2\)\}\(t\\mid x\)\. This inequality implies their net hazards must differ:λ\(1\)\(t∣x\)≠λ\(2\)\(t∣x\)\\lambda^\{\(1\)\}\(t\\mid x\)\\neq\\lambda^\{\(2\)\}\(t\\mid x\)\. By the equality derived above, their observed cause\-specific hazards must also differ:λE∣X\(1\)\(t∣x\)≠λE∣X\(2\)\(t∣x\)\\lambda\_\{E\\mid X\}^\{\(1\)\}\(t\\mid x\)\\neq\\lambda\_\{E\\mid X\}^\{\(2\)\}\(t\\mid x\)\. Different hazards imply different observed data distributions\. Thus, the model is identifiable\. ∎
These properties imply that while SurvivalPFN can be robustly trained under assumptions of marginal or conditional independence, it cannot learn dependent censoring mechanisms from observed data alone\.
## Appendix CPosterior\-Predictive Consistency
This appendix formalizes Proposition[3\.1](https://arxiv.org/html/2605.15488#S3.Thmproposition1)\. The result concerns the Bayesian posterior predictive survival distribution \(PPSD\) defined in Equation[2\.3](https://arxiv.org/html/2605.15488#S2.E3)\. It shows that, under an identifiable survival prior, the Bayesian PPSD is asymptotically consistent for the true conditional event\-time survival function\. We then state the corresponding idealized implication for SurvivalPFN when the transformer exactly amortizes this Bayesian target\.
### C\.1Notation and regularity assumptions
Recall that a survival data\-generating process \(DGP\)Pθ\(X,E,C,T,Δ\)P^\{\\theta\}\(X,E,C,T,\\Delta\)is indexed byθ∈Θ\\theta\\in\\Theta\. We writePobsθP^\{\\theta\}\_\{\\mathrm\{obs\}\}for the marginal distribution over the observable random variables\(X,T,Δ\)\(X,T,\\Delta\)\. We still useπ\(⋅\)\\pi\(\\cdot\)to denote the prior overΘ\\Thetainduced by the synthetic prior\-data generator used to pretrain SurvivalPFN\.
For eachθ\\theta, let
SE∣X,Θ\(t∣x,θ\):=Pr\(E\>t∣X=x,Θ=θ\)S\_\{E\\mid X,\\Theta\}\(t\\mid x,\\theta\)\\ :=\\ \\Pr\(E\>t\\mid X=x,\\Theta=\\theta\)denote the conditional event\-time survival function\. For an observed dataset𝒟=\{\(Xi,Ti,Δi\)\}i=1N\\mathcal\{D\}=\\\{\(X\_\{i\},T\_\{i\},\\Delta\_\{i\}\)\\\}\_\{i=1\}^\{N\}, where\(Xi,Ti,Δi\)∼i\.i\.d\.Pobsθ\(X\_\{i\},T\_\{i\},\\Delta\_\{i\}\)\\overset\{\\mathrm\{i\.i\.d\.\}\}\{\\sim\}P^\{\\theta\}\_\{\\mathrm\{obs\}\}, and a query covariate vectorx∗x^\{\\ast\}, the Bayesian PPSD is \(repeated from Equation[2\.3](https://arxiv.org/html/2605.15488#S2.E3)\)
SE∣X,𝒟\(t∣x∗,𝒟\)=∫ΘSE∣X,Θ\(t∣x∗,ϑ\)fΘ∣𝒟\(ϑ∣𝒟\)𝑑ϑ\.S\_\{E\\mid X,\\mathscr\{D\}\}\(t\\mid x^\{\\ast\},\\mathcal\{D\}\)\\ =\\ \\int\_\{\\Theta\}S\_\{E\\mid X,\\Theta\}\(t\\mid x^\{\\ast\},\\vartheta\)f\_\{\\Theta\\mid\\mathscr\{D\}\}\(\\vartheta\\mid\\mathcal\{D\}\)\\,d\\vartheta\.
###### Assumption C\.1\(Regularity\)\.
We assume the following standard regularity conditions\.
1. 1\.\(Θ,ℬΘ\)\(\\Theta,\\mathcal\{B\}\_\{\\Theta\}\)is a standard Borel parameter space\.
2. 2\.The mapsθ↦Pθ\\theta\\mapsto P^\{\\theta\}andθ↦Pobsθ\\theta\\mapsto P^\{\\theta\}\_\{\\mathrm\{obs\}\}are measurable\.
3. 3\.The image set\{Pobsθ:θ∈Θ\}\\\{P^\{\\theta\}\_\{\\mathrm\{obs\}\}:\\theta\\in\\Theta\\\}is a Borel subset of the space of probability measures over\(X,T,Δ\)\(X,T,\\Delta\)\.
4. 4\.For each timettin the evaluation region, there exists a version ofSE∣X,Θ\(t∣x∗,θ\)S\_\{E\\mid X,\\Theta\}\(t\\mid x^\{\\ast\},\\theta\)that is jointly measurable in\(x∗,θ\)\(x^\{\\ast\},\\theta\)\.
These assumptions are technical rather than substantive\. They ensure that priors, posteriors, conditional expectations, and the quotient construction below are well\-defined\. They are satisfied by the usual finite\-dimensional parameter spaces and measurable simulators used in statistical and machine learning models\.
### C\.2Observed\-law equivalence and survival identifiability
The full latent lawPθ\(X,E,C,T,Δ\)P^\{\\theta\}\(X,E,C,T,\\Delta\)is generally not identifiable from right\-censored observations\. The data can identify only the observed lawPobsθP^\{\\theta\}\_\{\\mathrm\{obs\}\}over\(X,T,Δ\)\(X,T,\\Delta\)\. We therefore group DGP parameters by observational equivalence:
θ1∼θ2⟺Pobsθ1=Pobsθ2\.\\theta\_\{1\}\\ \\sim\\ \\theta\_\{2\}\\qquad\\Longleftrightarrow\\qquad P^\{\\theta\_\{1\}\}\_\{\\mathrm\{obs\}\}\\ =\\ P^\{\\theta\_\{2\}\}\_\{\\mathrm\{obs\}\}\.Let
𝒬:=Θ/∼\\mathcal\{Q\}\\ :=\\ \\Theta/\\\!\\simdenotes the set of observational quotient space \(equivalence classes\) induced by∼\\sim, and let\[θ\]∈𝒬\[\\theta\]\\in\\mathcal\{Q\}denote the equivalence class ofθ\\theta\. Each class\[θ\]\[\\theta\]contains all latent survival DGPs that are indistinguishable\.
###### Definition C\.1\(Survival\-identifiable prior\)\.
Fix a timettin the evaluation region\. We say that the priorπ\\piis*survival\-identifiable at timett*if there exists a measurable map
Ft:𝒳×𝒬→\[0,1\]F\_\{t\}:\\mathcal\{X\}\\times\\mathcal\{Q\}\\ \\to\\ \[0,1\]such that, forπ\\pi\-almost everyθ\\thetaandPXπP\_\{X\}^\{\\pi\}\-almost everyxx\(except whereθ\\thetaorxxhas probability0\),111In the following, we will just use the phrase “everyθ\\theta” and “everyxx” for simplicity\.
SE∣X,Θ\(t∣x,θ\)=Ft\(x,\[θ\]\)\.S\_\{E\\mid X,\\Theta\}\(t\\mid x,\\theta\)\\ =\\ F\_\{t\}\(x,\[\\theta\]\)\.Equivalently, within the support of the prior, any two DGPs that induce the same observational distribution over\(X,T,Δ\)\(X,T,\\Delta\)must also induce the same conditional event\-time survival function at timett\.
The conditional independent censoring and positivity assumptions discussed in Section[2](https://arxiv.org/html/2605.15488#S2)provide sufficient conditions for this definition: under those assumptions,SE∣X\(t∣x\)S\_\{E\\mid X\}\(t\\mid x\)is a functional of the observed lawP\(X,T,Δ\)P\(X,T,\\Delta\)on the identifiable time region\.
### C\.3Consistency of the Bayesian PPSD
###### Proposition C\.1\(Formal consistency\)\.
Fix a timettin the evaluation region\. Under Assumption[C\.1](https://arxiv.org/html/2605.15488#A3.Thmassumption1), there exist sets𝒳0⊆𝒳\\mathcal\{X\}\_\{0\}\\subseteq\\mathcal\{X\}andΘ0⊆Θ\\Theta\_\{0\}\\subseteq\\Thetawith
PXπ\(𝒳0\)=1,π\(Θ0\)=1,P\_\{X\}^\{\\pi\}\(\\mathcal\{X\}\_\{0\}\)\\ =\\ 1,\\qquad\\pi\(\\Theta\_\{0\}\)\\ =\\ 1,such that for everyx∗∈𝒳0x^\{\\ast\}\\in\\mathcal\{X\}\_\{0\}and everyθ∗∈Θ0\\theta^\{\\ast\}\\in\\Theta\_\{0\}, then
SE∣X,𝒟\(t∣x∗,𝒟\)→N→∞a\.s\.SE∣X,Θ\(t∣x∗,θ∗\)\\displaystyle S\_\{E\\mid X,\\mathscr\{D\}\}\(t\\mid x^\{\\ast\},\\mathcal\{D\}\)\\ \\xrightarrow\[N\\to\\infty\]\{\\mathrm\{a\.s\.\}\}\\ S\_\{E\\mid X,\\Theta\}\(t\\mid x^\{\*\},\\theta^\{\\ast\}\)\(C\.1\)if and only if the priorπ\\piis survival\-identifiable at timett\.
For any finite or countable evaluation grid𝒢\\mathcal\{G\}, the same result holds simultaneously for allt∈𝒢t\\in\\mathcal\{G\}by intersecting the corresponding full\-measure sets\.
###### Proof\.
We first define the quotient\-level target\. For fixedttandxx, we define
Mt\(x∗,\[θ\]\):=𝔼π\[SE∣X,Θ\(t∣x∗,ϑ\)∣\[ϑ\]=\[θ\]\]\.M\_\{t\}\(x^\{\\ast\},\[\\theta\]\)\\ :=\\ \\mathbb\{E\}\_\{\\pi\}\\left\[S\_\{E\\mid X,\\Theta\}\(t\\mid x^\{\\ast\},\\vartheta\)\\mid\[\\vartheta\]=\[\\theta\]\\right\]\.This is the prior\-average survival probability among all DGPs that induce the same observed law asθ\\theta\. Since0≤SE∣X,Θ\(t∣x∗,ϑ\)≤10\\leq S\_\{E\\mid X,\\Theta\}\(t\\mid x^\{\\ast\},\\vartheta\)\\leq 1, this conditional expectation is integrable\.
By construction, two different elements of𝒬\\mathcal\{Q\}correspond to two different observational distributions\. Thus, the quotient parameter\[θ\]\[\\theta\]is identifiable from the observed distribution\. Assumption[C\.1](https://arxiv.org/html/2605.15488#A3.Thmassumption1)ensures that the quotient model is measurable and that Doob’s consistency theorem\[[14](https://arxiv.org/html/2605.15488#bib.bib162)\]applies to posterior expectations of integrable functions on this quotient space\.
Therefore, for every true parameterθ∗\\theta^\{\\ast\}\(except whereθ\\thetahas probability0\), if𝒟∼Pobsθ∗\\mathcal\{D\}\\sim P^\{\\theta^\{\\ast\}\}\_\{\\mathrm\{obs\}\}, then
𝔼π\[Mt\(x∗,\[θ\]\)∣𝒟\]→N→∞a\.s\.Mt\(x∗,\[θ∗\]\)\.\\displaystyle\\mathbb\{E\}\_\{\\pi\}\\left\[M\_\{t\}\(x^\{\\ast\},\[\\theta\]\)\\mid\\mathcal\{D\}\\right\]\\ \\xrightarrow\[N\\to\\infty\]\{\\mathrm\{a\.s\.\}\}\\ M\_\{t\}\(x^\{\\ast\},\[\\theta^\{\\ast\}\]\)\.\(C\.2\)
We now show that the left\-hand side of Equation[C\.2](https://arxiv.org/html/2605.15488#A3.E2)is the Bayesian PPSD\. Since the observed data distribution depends onθ\\thetaonly through the equivalence class\[θ\]\[\\theta\], we have the conditional independence
𝒟⟂θ∣\[θ\]\.\\mathcal\{D\}\\perp\\theta\\mid\[\\theta\]\.Hence,
𝔼π\[Mt\(x∗,\[θ\]\)∣𝒟\]\\displaystyle\\mathbb\{E\}\_\{\\pi\}\\left\[M\_\{t\}\(x^\{\\ast\},\[\\theta\]\)\\mid\\mathcal\{D\}\\right\]=𝔼π\[𝔼π\[SE∣X,Θ\(t∣x∗,θ\)∣\[θ\]\]∣𝒟\]\\displaystyle=\\ \\mathbb\{E\}\_\{\\pi\}\\left\[\\mathbb\{E\}\_\{\\pi\}\\left\[S\_\{E\\mid X,\\Theta\}\(t\\mid x^\{\\ast\},\\theta\)\\mid\[\\theta\]\\right\]\\mid\\mathcal\{D\}\\right\]=𝔼π\[𝔼π\[SE∣X,Θ\(t∣x∗,θ\)∣\[θ\],𝒟\]∣𝒟\]\\displaystyle=\\ \\mathbb\{E\}\_\{\\pi\}\\left\[\\mathbb\{E\}\_\{\\pi\}\\left\[S\_\{E\\mid X,\\Theta\}\(t\\mid x^\{\\ast\},\\theta\)\\mid\[\\theta\],\\mathcal\{D\}\\right\]\\mid\\mathcal\{D\}\\right\]=𝔼π\[SE∣X,Θ\(t∣x∗,θ\)∣𝒟\]\\displaystyle=\\ \\mathbb\{E\}\_\{\\pi\}\\left\[S\_\{E\\mid X,\\Theta\}\(t\\mid x^\{\\ast\},\\theta\)\\mid\\mathcal\{D\}\\right\]=∫ΘSE∣X,Θ\(t∣x∗,ϑ\)fΘ∣𝒟\(ϑ∣𝒟\)𝑑ϑ\\displaystyle=\\ \\int\_\{\\Theta\}S\_\{E\\mid X,\\Theta\}\(t\\mid x^\{\\ast\},\\vartheta\)f\_\{\\Theta\\mid\\mathscr\{D\}\}\(\\vartheta\\mid\\mathcal\{D\}\)\\,d\\vartheta=SE∣X,𝒟\(t∣x∗,𝒟\)\.\\displaystyle=\\ S\_\{E\\mid X,\\mathscr\{D\}\}\(t\\mid x^\{\\ast\},\\mathcal\{D\}\)\.Combining this equality with Equation[C\.2](https://arxiv.org/html/2605.15488#A3.E2)gives
SE∣X,𝒟\(t∣x∗,𝒟\)→N→∞a\.s\.Mt\(x∗,\[θ∗\]\)\.\\displaystyle S\_\{E\\mid X,\\mathscr\{D\}\}\(t\\mid x^\{\\ast\},\\mathcal\{D\}\)\\ \\xrightarrow\[N\\to\\infty\]\{\\mathrm\{a\.s\.\}\}\\ M\_\{t\}\(x^\{\\ast\},\[\\theta^\{\\ast\}\]\)\.\(C\.3\)Without identifiability, this is the strongest possible conclusion: the PPSD converges to the prior\-average survival function over DGPs that are observationally equivalent to the truth\.
Finally, noteπ\\piis survival\-identifiable at timett\. Then, by Definition[C\.1](https://arxiv.org/html/2605.15488#A3.Thmdefinition1), all DGPs in the same observational equivalence class have the same survival probability at\(t,x∗\)\(t,x^\{\\ast\}\)\. Therefore,
Mt\(x∗,\[θ∗\]\)\\displaystyle M\_\{t\}\(x^\{\\ast\},\[\\theta^\{\\ast\}\]\)\\=𝔼π\[SE∣X,Θ\(t∣x∗,ϑ\)∣\[ϑ\]=\[θ∗\]\]\\displaystyle=\\ \\mathbb\{E\}\_\{\\pi\}\\left\[S\_\{E\\mid X,\\Theta\}\(t\\mid x^\{\\ast\},\\vartheta\)\\mid\[\\vartheta\]=\[\\theta^\{\\ast\}\]\\right\]=SE∣X,Θ\(t∣x∗,θ∗\)\.\\displaystyle=\\ S\_\{E\\mid X,\\Theta\}\(t\\mid x^\{\\ast\},\\theta^\{\\ast\}\)\.Substituting this into Equation[C\.3](https://arxiv.org/html/2605.15488#A3.E3)proves Equation[C\.1](https://arxiv.org/html/2605.15488#A3.E1)\. This establishes sufficiency\.
For necessity, suppose that the Bayesian PPSD is consistent in the sense of Equation[C\.1](https://arxiv.org/html/2605.15488#A3.E1)for everyθ∗\\theta^\{\\ast\}\. From Equation[C\.3](https://arxiv.org/html/2605.15488#A3.E3), the same sequence also converges almost surely toMt\(x∗,\[θ∗\]\)M\_\{t\}\(x^\{\\ast\},\[\\theta^\{\\ast\}\]\)\. By uniqueness of almost\-sure limits,
SE∣X,Θ\(t∣x∗,θ∗\)=Mt\(x∗,\[θ∗\]\)S\_\{E\\mid X,\\Theta\}\(t\\mid x^\{\\ast\},\\theta^\{\\ast\}\)\\ =\\ M\_\{t\}\(x^\{\\ast\},\[\\theta^\{\\ast\}\]\)for everyθ∗\\theta^\{\\ast\}\. Since the right\-hand side depends onθ∗\\theta^\{\\ast\}only through its observational equivalence class\[θ∗\]\[\\theta^\{\\ast\}\], the survival functionalSE∣X,Θ\(t∣x∗,θ∗\)S\_\{E\\mid X,\\Theta\}\(t\\mid x^\{\\ast\},\\theta^\{\\ast\}\)also depends only on\[θ∗\]\[\\theta^\{\\ast\}\]\. Thus, Definition[C\.1](https://arxiv.org/html/2605.15488#A3.Thmdefinition1)holds withFt\(x∗,\[θ\]\)=Mt\(x∗,\[θ\]\)F\_\{t\}\(x^\{\\ast\},\[\\theta\]\)=M\_\{t\}\(x^\{\\ast\},\[\\theta\]\), up to arbitrary definition on null sets\. Hence the prior is survival\-identifiable at timett\. ∎
### C\.4Implication for SurvivalPFN
Proposition[C\.1](https://arxiv.org/html/2605.15488#A3.Thmproposition1)characterizes the Bayesian posterior predictive target\. SurvivalPFN is a finite neural network trained to approximate this target, so the theorem does not by itself prove finite\-sample or finite\-capacity consistency of the trained transformer\. It does, however, imply consistency in the idealized exact\-amortization limit\.
###### Corollary C\.1\(Consistency of SurvivalPFN\)\.
Assume the conditions of Proposition[C\.1](https://arxiv.org/html/2605.15488#A3.Thmproposition1)\. Suppose that, forδ~∗=1\\widetilde\{\\delta\}^\{\\ast\}=1, SurvivalPFN exactly amortizes the Bayesian posterior predictive event\-time distribution, so that its predicted survival curve satisfies
S^ω\(t∣x∗,𝒟\)=SE∣X,𝒟\(t∣x∗,𝒟\)\.\\widehat\{S\}\_\{\\omega\}\(t\\mid x^\{\\ast\},\\mathcal\{D\}\)\\ =\\ S\_\{E\\mid X,\\mathscr\{D\}\}\(t\\mid x^\{\\ast\},\\mathcal\{D\}\)\.Then, for every true DGP parameterθ∗\\theta^\{\\ast\}and everyxx,
S^ω\(t∣x∗,𝒟\)→N→∞a\.s\.SE∣X,Θ\(t∣x∗,θ∗\)\.\\widehat\{S\}\_\{\\omega\}\(t\\mid x^\{\\ast\},\\mathcal\{D\}\)\\ \\xrightarrow\[N\\to\\infty\]\{\\mathrm\{a\.s\.\}\}\\ S\_\{E\\mid X,\\Theta\}\(t\\mid x^\{\\ast\},\\theta^\{\\ast\}\)\.More generally, the same conclusion holds if the amortization error vanishes:
\|S^ω\(t∣x∗,𝒟\)−SE∣X,𝒟\(t∣x∗,𝒟\)\|→N→∞0\.\\left\|\\widehat\{S\}\_\{\\omega\}\(t\\mid x^\{\\ast\},\\mathcal\{D\}\)\-S\_\{E\\mid X,\\mathscr\{D\}\}\(t\\mid x^\{\\ast\},\\mathcal\{D\}\)\\right\|\\ \\xrightarrow\[N\\to\\infty\]\{\}\\ 0\.
###### Proof\.
The exact\-amortization case follows immediately by substitutingS^ω\(t∣x∗,𝒟\)=SE∣X,𝒟\(t∣x∗,𝒟\)\\widehat\{S\}\_\{\\omega\}\(t\\mid x^\{\\ast\},\\mathcal\{D\}\)=S\_\{E\\mid X,\\mathscr\{D\}\}\(t\\mid x^\{\\ast\},\\mathcal\{D\}\)into Proposition[C\.1](https://arxiv.org/html/2605.15488#A3.Thmproposition1)\. The vanishing\-error case follows from the triangle inequality:
\|S^ω\(t∣x∗,𝒟\)−SE∣X,Θ\(t∣x∗,θ∗\)\|\\displaystyle\\left\|\\widehat\{S\}\_\{\\omega\}\(t\\mid x^\{\\ast\},\\mathcal\{D\}\)\-S\_\{E\\mid X,\\Theta\}\(t\\mid x^\{\\ast\},\\theta^\{\\ast\}\)\\right\|≤\|S^ω\(t∣x∗,𝒟\)−SE∣X,𝒟\(t∣x∗,𝒟\)\|\+\|SE∣X,𝒟\(t∣x∗,𝒟\)−SE∣X,Θ\(t∣x∗,θ∗\)\|\.\\displaystyle\\qquad\\leq\\ \\left\|\\widehat\{S\}\_\{\\omega\}\(t\\mid x^\{\\ast\},\\mathcal\{D\}\)\-S\_\{E\\mid X,\\mathscr\{D\}\}\(t\\mid x^\{\\ast\},\\mathcal\{D\}\)\\right\|\+\\left\|S\_\{E\\mid X,\\mathscr\{D\}\}\(t\\mid x^\{\\ast\},\\mathcal\{D\}\)\-S\_\{E\\mid X,\\Theta\}\(t\\mid x^\{\\ast\},\\theta^\{\\ast\}\)\\right\|\.The first term vanishes by assumption, and the second term vanishes almost surely by Proposition[C\.1](https://arxiv.org/html/2605.15488#A3.Thmproposition1)\. ∎
## Appendix DAdditional SurvivalPFN Details
### D\.1Synthetic Prior and DGP Simulation
SurvivalPFN is pretrained on synthetic right\-censored survival datasets sampled from a priorπ\(⋅\)\\pi\(\\cdot\)over identifiable survival data\-generating processes\. A drawθ∼π\(⋅\)\\theta\\sim\\pi\(\\cdot\)specifies all random choices needed to generate one synthetic task, including the covariate distribution, event\-time mechanism, censoring\-time mechanism, censoring type, and target censoring rate\.
The simulator returns both the observed survival dataset
𝒟θtr=\{\(xi,ti,δi\)θ\}i=1N,\\mathcal\{D\}^\{tr\}\_\{\\theta\}\\ =\\ \\\{\(x\_\{i\},t\_\{i\},\\delta\_\{i\}\)\_\{\\theta\}\\\}\_\{i=1\}^\{N\},and the corresponding latent event and censoring times\{\(ei,ci\)θ\}i=1N\\\{\(e\_\{i\},c\_\{i\}\)\_\{\\theta\}\\\}\_\{i=1\}^\{N\}, which are used only for constructing supervised query targets during prior\-data training\.
#### Random Tabular Generators\.
All survival priors are built from a generic tabular generator\. We writeGGfor a random table generator that can sample either an unconditional table or a conditional table\. Concretely, a call toGGfirst samples generator\-specific latent parametersζ\\zeta, and then produces either
X∼Gζ\(⋅\),Y∣X∼Gζ\(⋅∣X\)\.X\\ \\sim\\ G\_\{\\zeta\}\(\\cdot\),\\qquad Y\\mid X\\ \\sim\\ G\_\{\\zeta\}\(\\cdot\\mid X\)\.based on the requirement\.
This abstraction also lets the same survival\-prior design use different tabular generators, including PFN\-style random multilayer perceptrons, random structural causal models, tree\-based generators, or mixtures of these sources\.
#### Censoring Mechanisms\.
We include several censoring mechanisms to cover common right\-censoring regimes while retaining identifiability\.
Uniform censoring\.Event times are generated from the event\-time prior, while censoring times are sampled from a uniform distribution\. Depending on the time\-generation family, the support is either based on the generated event\-time range or on a sampled global time horizon:
Ci∼Unif\(minjEj,maxjEj\),orCi∼Unif\(0,tmax\)\.C\_\{i\}\\ \\sim\\ \\operatorname\{Unif\}\\\!\\left\(\\min\_\{j\}E\_\{j\},\\,\\max\_\{j\}E\_\{j\}\\right\),\\qquad\\text\{or\}\\qquad C\_\{i\}\\ \\sim\\ \\operatorname\{Unif\}\(0,t\_\{\\max\}\)\.
Random censoring\.Event times are generated conditionally onXX, while censoring times are generated from a separate unconditional tabular mechanism:
E∣X∼Pr\(E∣X\),C∼Pr\(C\)\.E\\mid X\\ \\sim\\ \\Pr\(E\\mid X\),\\qquad C\\ \\sim\\ \\Pr\(C\)\.This produces censoring distributions that are more diverse than simple uniform censoring while remaining independent of the event\-time noise conditional on the task\.
Administrative censoring\.Administrative censoring simulates staggered study entry with a common study end date\. The simulator samples entry timesAiA\_\{i\}, chooses a fixed administrative end timea∗a\_\{\\ast\}, and sets
Ci=a∗−Ai\.C\_\{i\}\\ =\\ a\_\{\\ast\}\-A\_\{i\}\.This captures a common survival\-analysis setting in which subjects enter at different calendar times but are all censored at the same study termination date\.
Independent censoring\.For independent \(or covariate\-dependent\) censoring, both event and censoring times are generated conditional onXX, but from independent blocks:
E⟂C∣X,Pr\(E,C∣X\)=Pr\(E∣X\)Pr\(C∣X\)\.E\\perp C\\mid X,\\qquad\\Pr\(E,C\\mid X\)\\ =\\ \\Pr\(E\\mid X\)\\,\\Pr\(C\\mid X\)\.This mechanism is central to the identifiable\-prior construction: it allows censoring to depend on covariates while preserving the standard conditional independent censoring assumption used for nonparametric identification\.
#### Prior Family 1: Naive Survival Prior\.
The simplest prior treats generated table values as raw event and censoring times\. LetGζG\_\{\\zeta\}denote a conditional event/censoring table generator\. For the event\-time branch, we sample
E=Gζ\(X;1\),E\\ =\\ G\_\{\\zeta\}\(X;1\),whereGζ\(X;1\)G\_\{\\zeta\}\(X;1\)denotes one conditional output column givenXX\. The censoring branch is then chosen according to the censoring type: uniform censoring samplesCCfrom a uniform distribution; tabular censoring samplesCCfrom an unconditional table generator; administrative censoring constructsCCfrom entry times; and the conditional independent branch samples separate event and censoring columns conditional onXX\. This prior is intentionally simple and flexible: it exposes the model to irregular, nonparametric time patterns without imposing an explicit survival\-time family\.
#### Prior Family 2: Survival\-Distribution Prior\.
The second prior generates smooth distributions using random monotone maps\. For each dataset, the simulator samples a time horizontmax\>0t\_\{\\max\}\>0, a number of knotsKK, and unconstrained coefficients
c=\(c1,…,cK\)∈ℝK\.c\\ =\\ \(c\_\{1\},\\ldots,c\_\{K\}\)\\ \\in\\ \\mathbb\{R\}^\{K\}\.These coefficients are converted into positive increments
Δj=exp\(cj\)∑ℓ=1Kexp\(cℓ\),j=1,…,K,\\Delta\_\{j\}\\ =\\ \\frac\{\\exp\(c\_\{j\}\)\}\{\\sum\_\{\\ell=1\}^\{K\}\\exp\(c\_\{\\ell\}\)\},\\qquad j=1,\\ldots,K,which define monotone control points
b0=0,bj=∑ℓ=1jΔℓ,j=1,…,K\.b\_\{0\}\\ =\\ 0,\\qquad b\_\{j\}\\ =\\ \\sum\_\{\\ell=1\}^\{j\}\\Delta\_\{\\ell\},\\qquad j=1,\\ldots,K\.The resulting monotone Bernstein map is
fc\(u\)=∑j=0Kbj\(Kj\)uj\(1−u\)K−j,u∈\[0,1\]\.f\_\{c\}\(u\)\\ =\\ \\sum\_\{j=0\}^\{K\}b\_\{j\}\{K\\choose j\}u^\{j\}\(1\-u\)^\{K\-j\},\\qquad u\\in\[0,1\]\.A raw time is then sampled by pushing uniform noise through this monotone map:
τ=tmaxfc\(U\),U∼Unif\(0,1\)\.\\tau\\ =\\ t\_\{\\max\}f\_\{c\}\(U\),\\qquad U\\ \\sim\\ \\operatorname\{Unif\}\(0,1\)\.This construction can be viewed as a random smooth quantile\-like function\. It gives a flexible family of event\-time and censoring\-time distributions without committing to a fixed parametric hazard form\. Under conditional independent censoring, event and censoring coefficient blocks are generated independently givenXX:
Ei\\displaystyle E\_\{i\}\\=tmaxfciE\(UiE\),\\displaystyle=\\ t\_\{\\max\}f\_\{c\_\{i\}^\{E\}\}\(U\_\{i\}^\{E\}\),Ci\\displaystyle C\_\{i\}\\=tmaxfciC\(UiC\),\\displaystyle=\\ t\_\{\\max\}f\_\{c\_\{i\}^\{C\}\}\(U\_\{i\}^\{C\}\),with independentUiE,UiC∼Unif\(0,1\)U\_\{i\}^\{E\},U\_\{i\}^\{C\}\\sim\\operatorname\{Unif\}\(0,1\)and independent coefficient blocks conditional onxix\_\{i\}\.
#### Prior Family 3: Mixture Prior\.
The third prior samples event and censoring times from mixtures of distributions\. In our implementation, the component family is sampled from Weibull and lognormal distributions, and the number of componentsKmixK\_\{\\mathrm\{mix\}\}is sampled from a finite candidate set\. For each rowii, a table generator emits hidden values
hi=\(ai1,bi1,ri1,…,aiKmix,biKmix,riKmix\)\.h\_\{i\}\\ =\\ \(a\_\{i1\},b\_\{i1\},r\_\{i1\},\\ldots,a\_\{iK\_\{\\mathrm\{mix\}\}\},b\_\{iK\_\{\\mathrm\{mix\}\}\},r\_\{iK\_\{\\mathrm\{mix\}\}\}\)\.The logitsrijr\_\{ij\}define mixture weights
πij=exp\(rij\)∑ℓ=1Kmixexp\(riℓ\),∑j=1Kmixπij=1,\\pi\_\{ij\}\\ =\\ \\frac\{\\exp\(r\_\{ij\}\)\}\{\\sum\_\{\\ell=1\}^\{K\_\{\\mathrm\{mix\}\}\}\\exp\(r\_\{i\\ell\}\)\},\\qquad\\sum\_\{j=1\}^\{K\_\{\\mathrm\{mix\}\}\}\\pi\_\{ij\}\\ =\\ 1,and a component index is sampled as
Zi∼Categorical\(πi1,…,πiKmix\)\.Z\_\{i\}\\ \\sim\\ \\operatorname\{Categorical\}\(\\pi\_\{i1\},\\ldots,\\pi\_\{iK\_\{\\mathrm\{mix\}\}\}\)\.For a Weibull component, positive shape and scale parameters are obtained by
κij\\displaystyle\\kappa\_\{ij\}\\=softplus\(aij\)\+0\.1,\\displaystyle=\\ \\operatorname\{softplus\}\(a\_\{ij\}\)\+0\.1,λij\\displaystyle\\lambda\_\{ij\}\\=softplus\(bij\)\+0\.1,\\displaystyle=\\ \\operatorname\{softplus\}\(b\_\{ij\}\)\+0\.1,and the sampled time is
τi=λiZi\[−log\(1−Ui\)\]1/κiZi,Ui∼Unif\(0,1\)\.\\tau\_\{i\}\\ =\\ \\lambda\_\{iZ\_\{i\}\}\\left\[\-\\log\(1\-U\_\{i\}\)\\right\]^\{1/\\kappa\_\{iZ\_\{i\}\}\},\\qquad U\_\{i\}\\ \\sim\\ \\operatorname\{Unif\}\(0,1\)\.For a lognormal component,
μij\\displaystyle\\mu\_\{ij\}\\=softplus\(aij\),\\displaystyle=\\ \\operatorname\{softplus\}\(a\_\{ij\}\),σij\\displaystyle\\sigma\_\{ij\}\\=softplus\(bij\),\\displaystyle=\\ \\operatorname\{softplus\}\(b\_\{ij\}\),and
logτi=μiZi\+σiZiεi,εi∼𝒩\(0,1\)\.\\log\\tau\_\{i\}=\\mu\_\{iZ\_\{i\}\}\+\\sigma\_\{iZ\_\{i\}\}\\varepsilon\_\{i\},\\qquad\\varepsilon\_\{i\}\\sim\\mathcal\{N\}\(0,1\)\.This mixture prior provides a complementary source of smooth positive\-time distributions with heavy tails and multimodality\.
#### Prior Family 4: Kitchen\-Sink Meta Prior\.
Finally, SurvivalPFN can use a meta\-prior that mixes multiple complete survival priors\. LetP1,…,PMP\_\{1\},\\ldots,P\_\{M\}denote child prior generators and letw1,…,wMw\_\{1\},\\ldots,w\_\{M\}be nonnegative mixture weights with∑mwm=1\\sum\_\{m\}w\_\{m\}=1\. The kitchen\-sink prior samples
M∗∼Categorical\(w1,…,wM\),𝒟θtr∼PM∗\.M^\{\\ast\}\\ \\sim\\ \\operatorname\{Categorical\}\(w\_\{1\},\\ldots,w\_\{M\}\),\\qquad\\mathcal\{D\}^\{tr\}\_\{\\theta\}\\ \\sim\\ P\_\{M^\{\\ast\}\}\.Equivalently, the induced prior over synthetic datasets is
Psink\(𝒟\)=∑m=1MwmPm\(𝒟\)\.P\_\{\\mathrm\{sink\}\}\(\\mathcal\{D\}\)\\ =\\ \\sum\_\{m=1\}^\{M\}w\_\{m\}P\_\{m\}\(\\mathcal\{D\}\)\.This mixture increases prior diversity by combining direct table\-output mechanisms, smooth monotone distributional mechanisms, and positive\-time mixture mechanisms\. In the current default configuration, the kitchen\-sink prior places most mass on the direct table\-output and monotone survival\-distribution priors, while the exact mixture weights remain configurable\.
#### Diagnostics for Synthetic DGPs\.
We further examine the synthetic DGPs induced by each prior family in Figures[9](https://arxiv.org/html/2605.15488#A4.F9)and[10](https://arxiv.org/html/2605.15488#A4.F10)\. These diagnostics are designed to check the two desiderata of the prior: it should remain close to the identifiable independent\-censoring regime, while still covering a broad range of survival\-data regimes\. For each prior family, we sample 500 synthetic datasets, each with 1,024 observations, and compute dataset\-level and curve\-level summaries\.
\(\(a\)\)Naive prior
\(\(b\)\)Survival\-distribution prior
\(\(c\)\)Mixture prior
\(\(d\)\)Kitchen\-sink prior
Figure 9:Prior\-quality diagnostics across four synthetic prior families, with 500 sampled datasets per family\. Each panel summarizes the induced dependence between event and censoring times, censoring\-rate coverage, time\-scale dispersion, and conditional event\-time uncertainty\.\(\(a\)\)Naive prior
\(\(b\)\)Survival\-distribution prior
\(\(c\)\)Mixture\-model prior
\(\(d\)\)Kitchen\-sink prior
Figure 10:Empirical distributions across 500 synthetic datasets\.Each panel shows the induced marginal event\-time survival curveP\(E\>t\)P\(E\>t\), censoring\-time survival curveP\(C\>t\)P\(C\>t\), and observed Kaplan\-Meier curveS^KM\(t\)\\widehat\{S\}\_\{\\mathrm\{KM\}\}\(t\)\. Solid lines denote the pointwise median curve across generated datasets\. Dark shaded bands denote the interquartile range \(25th\-75th percentiles\), and light shaded bands denote the 10th\-90th percentile range\.Figure[9](https://arxiv.org/html/2605.15488#A4.F9)reports scalar diagnostics for each generated dataset\. The left subpanel estimates the conditional mutual informationI\(E;C∣X\)I\(E;C\\mid X\)between the latent event timeEEand censoring timeCCafter conditioning on covariatesXX\. Since our simulator is constructed under conditional independence, values concentrated near zero provide an empirical sanity check that the generated censoring process does not introduce substantial residual event\-censoring dependence beyondXX\. The right subpanel summarizes dataset diversity: each point is one generated dataset, with the x\-axis showing the censoring rate, the y\-axis showing the observed\-time dispersionlog10\(CV\(T\)\)\\log\_\{10\}\(\\mathrm\{CV\}\(T\)\), whereCV\(T\)=std\(T\)/\|mean\(T\)\|\\mathrm\{CV\}\(T\)=\\mathrm\{std\}\(T\)/\|\\mathrm\{mean\}\(T\)\|is the coefficient of variation of the observed times, and the color showing the conditional observed\-time entropy𝔼X\[H\(T∣X\)\]\\mathbb\{E\}\_\{X\}\[H\(T\\mid X\)\]\. Thus, this panel visualizes how broadly each prior covers censoring rates, relative time\-scale variation, and residual outcome uncertainty after conditioning on covariates\.
Several trends are apparent from these plots\. First, across all four prior families, the estimated conditional mutual informationI\(E;C∣X\)I\(E;C\\mid X\)is highly concentrated near zero, suggesting that the generated datasets largely remain within the intended conditional independent censoring regime\. This provides an empirical sanity check that the prior does not typically generate strongly informative censoring structures\. Second, the scatter plots show that the priors cover a broad range of right\-censored survival regimes\. The generated datasets span nearly the full range of censoring rates, from lightly censored to heavily censored settings\. They also cover different observed\-time dispersion regimes, and different levels of residual conditional uncertainty, as measured by𝔼X\[H\(T∣X\)\]\\mathbb\{E\}\_\{X\}\[H\(T\\mid X\)\]\.
Figure[10](https://arxiv.org/html/2605.15488#A4.F10)complements these scalar summaries with curve\-level diagnostics\. For each generated dataset, we compute the empirical latent event\-time survival curveP\(E\>t\)P\(E\>t\), the empirical latent censoring\-time survival curveP\(C\>t\)P\(C\>t\), and the Kaplan\-Meier estimateS^KM\(t\)\\widehat\{S\}\_\{\\mathrm\{KM\}\}\(t\)from the observed pairs\(T,Δ\)\(T,\\Delta\)\. The solid line denotes the pointwise median curve across generated datasets, while the dark and light bands denote the pointwise 25th\-75th and 10th\-90th percentile ranges, respectively\. These curves show that the priors induce a wide range of event\-time, censoring\-time, and observed survival shapes, including different time scales, tail behaviors, and censoring\-distorted Kaplan\-Meier patterns\. Together, Figures[9](https://arxiv.org/html/2605.15488#A4.F9)and[10](https://arxiv.org/html/2605.15488#A4.F10)show that the proposed prior families generate diverse survival tasks while maintaining the identifiable right\-censoring structure required by our theoretical framework\. These curve\-level summaries further confirm that the diversity of DGPs\. The empirical distributions vary substantially across generated datasets, with different decay rates, time scales, and tail behaviors\.
### D\.2Architecture Details
SurvivalPFN uses the same PFN\-style transformer architecture as TabDPT\[[61](https://arxiv.org/html/2605.15488#bib.bib155)\]and CausalPFN\[[2](https://arxiv.org/html/2605.15488#bib.bib93)\], with only task\-specific changes to the token contents and output interpretation\. Each context row\(xi,ti,δi\)∈𝒟θtr\(x\_\{i\},t\_\{i\},\\delta\_\{i\}\)\\in\\mathcal\{D\}^\{tr\}\_\{\\theta\}is represented as a single token by combining embeddings of the covariatesxix\_\{i\}, observed timetit\_\{i\}, and event indicatorδi\\delta\_\{i\}\. Each query row is represented by embeddings of the query covariatexθ∗x^\{\\ast\}\_\{\\theta\}and the query indicatorδ~∗\\widetilde\{\\delta\}^\{\\ast\}\. We use linear embedding layers and omit positional encodings, so the context is treated as a set rather than an ordered sequence\.
All context and query tokens are passed through a 20\-layer transformer encoder with hidden dimension 384, RMS query\-key normalization, and parallel SwiGLU feed\-forward blocks\. The attention mask follows the standard PFN design\[[36](https://arxiv.org/html/2605.15488#bib.bib91),[37](https://arxiv.org/html/2605.15488#bib.bib92)\]: context tokens attend to one another, while query tokens attend only to the context\. This masking prevents information leakage across queries and allows the model to process many query instances in parallel\. The output representation of each query token is projected toL=1024L=1024logits, and a softmax produces a discretized PPD over the transformed time bins:
qω\(⋅∣xθ∗,δ~∗,𝒟θtr\)=\[qω,ℓ\(xθ∗,δ~∗,𝒟θtr\)\]ℓ=1L\.q\_\{\\omega\}\(\\cdot\\mid x^\{\\ast\}\_\{\\theta\},\\widetilde\{\\delta\}^\{\\ast\},\\mathcal\{D\}^\{tr\}\_\{\\theta\}\)\\ =\\ \\left\[q\_\{\\omega,\\ell\}\(x^\{\\ast\}\_\{\\theta\},\\widetilde\{\\delta\}^\{\\ast\},\\mathcal\{D\}^\{tr\}\_\{\\theta\}\)\\right\]\_\{\\ell=1\}^\{L\}\.Depending onδ~∗\\widetilde\{\\delta\}^\{\\ast\}, this distribution is interpreted as an approximation to either the PPD for event time or the PPD for censoring time\. The corresponding PPSD or PPCD is obtained by summing the predicted tail probability mass across bins\.
The full model has approximately 20M parameters and is trained in two stages: \(i\) a predictive phase that follows standard predictive PFN training fromMaet al\.\[[61](https://arxiv.org/html/2605.15488#bib.bib155)\], and \(ii\) a survival phase that optimizes the survival likelihood or cross\-entropy loss\. We use AdamW\[[48](https://arxiv.org/html/2605.15488#bib.bib69)\]with warmup and cosine annealing in the predictive phase, and switch to the schedule\-free optimizer\[[13](https://arxiv.org/html/2605.15488#bib.bib156)\]in the survival phase\. The maximum context length is 16K in the first phase and 2,048 in the second\. The predictive phase is trained on four A100 GPUs for up to one week, followed by two and half days of survival\-phase training on one H100 GPU\.
Training is parallelized over both synthetic context and query tokens\. At each gradient step, we sampleBθB\_\{\\theta\}independent DGPs from the prior, generate one context table for each DGP, and drawBqB\_\{q\}query rows per datasets\. TheseBθBqB\_\{\\theta\}B\_\{q\}query predictions are computed in a single batched forward pass, and the final loss is averaged over all DGP\-query pairs\. This batching strategy is identical in spirit to the parallel training procedure used in CausalPFN; see Algorithm 1 inBalazadehet al\.\[[2](https://arxiv.org/html/2605.15488#bib.bib93)\]for the full procedure\.
### D\.3Monotone Time Transformations
Survival times are nonnegative and often highly skewed\. Directly discretizing raw time can therefore allocate too many bins to sparse tail regions or make the learned distribution sensitive to dataset\-specific time scales\. To address this, SurvivalPFN applies a context\-fitted monotone transformation before discretization\.
For each dataset𝒟=\{\(xi,ti,δi\)\}i=1N\\mathcal\{D\}=\\\{\(x\_\{i\},t\_\{i\},\\delta\_\{i\}\)\\\}\_\{i=1\}^\{N\}, letg:ℝ\+→𝒵g:\\mathbb\{R\}\_\{\+\}\\to\\mathcal\{Z\}denote the dataset\-specific transformation fitted on the observed times for the context tokens\{ti\}i=1Nc\\\{t\_\{i\}\\\}\_\{i=1\}^\{N\_\{c\}\}, whereNcN\_\{c\}is the size of context tokens\. The observed times and query targets for context tokens are mapped into model space as
zi\\displaystyle z\_\{i\}\\=g\(ti\),\\displaystyle=\\ g\(t\_\{i\}\),ze∗\\displaystyle z\_\{e\}^\{\\ast\}\\=g\(e∗\),\\displaystyle=\\ g\(e^\{\\ast\}\),zc∗\\displaystyle z\_\{c\}^\{\\ast\}\\=g\(c∗\)\.\\displaystyle=\\ g\(c^\{\\ast\}\)\.The histogram likelihood Equation[3\.3](https://arxiv.org/html/2605.15488#S3.E3)or cross\-entropy loss in Equation[D\.2](https://arxiv.org/html/2605.15488#A4.E2)is then evaluated in this transformed space\.
To make prediction, model\-space bin edges are mapped back to raw time throughg−1g^\{\-1\}, so that the final PPSD/PPCD is reported on the original time scale\. In all cases,ggis monotone increasing, preserving the temporal ordering of events and censoring times\.
#### lognormal2normalTransformation\.
Thelognormal2normaltransformation uses a simple parametric normalization motivated by the positivity and right\-skewness of survival times\. For each dataset, we first calculate the meanmmand variances2s^\{2\}on the context tokens\. We fit a lognormal distribution whose raw\-time mean and standard deviation matchmmands2s^\{2\}\. IfT=exp\(μ\+σZ\)T=\\exp\(\\mu\+\\sigma Z\)withZ∼𝒩\(0,1\)Z\\sim\\mathcal\{N\}\(0,1\), then
σ2\\displaystyle\\sigma^\{2\}\\=log\(1\+s2m2\),\\displaystyle=\\ \\log\\\!\\left\(1\+\\frac\{s^\{2\}\}\{m^\{2\}\}\\right\),μ\\displaystyle\\mu\\=log\(m\)−12σ2\.\\displaystyle=\\ \\log\(m\)\-\\frac\{1\}\{2\}\\sigma^\{2\}\.The forward transformation maps a raw timet\>0t\>0to its fitted normal coordinate:
gLN\(t\)=logt−μσ\.g\_\{\\mathrm\{LN\}\}\(t\)\\ =\\ \\frac\{\\log t\-\\mu\}\{\\sigma\}\.The inverse transformation is
gLN−1\(z\)=exp\(μ\+σz\)\.g\_\{\\mathrm\{LN\}\}^\{\-1\}\(z\)\\ =\\ \\exp\(\\mu\+\\sigma z\)\.Thus, fixed\-width bins in model space correspond to adaptive, nonuniform bins in raw time\. This transformation is smooth, bijective onℝ\+\\mathbb\{R\}\_\{\+\}, preserves time ordering, and can extrapolate beyond the largest observed time\. Its main limitation is that it imposes a lognormal shape on the observed\-time distribution; this parametric normalization may not allocate resolution optimally when the parametric form is incorrect\.
#### time2quantileTransformation\.
Thetime2quantiletransformation is a nonparametric, context\-adaptive alternative\. It maps raw time into an empirical quantile coordinate in\[0,1\]\[0,1\]\. For all the context tokens in the dataset, sort the observed times and collapse duplicates to obtain unique knots
0=a0<a1<⋯<aK=maxiti\.0\\ =\\ a\_\{0\}\\ <\\ a\_\{1\}\\ <\\ \\cdots\\ <\\ a\_\{K\}\\ =\\ \\max\_\{i\}t\_\{i\}\.Each knotaja\_\{j\}is assigned its right\-continuous empirical CDF value
qj=F^\(aj\)=1Nc∑i=1Nc𝟙\{ti≤aj\},q0=0,qK=1\.q\_\{j\}\\ =\\ \\widehat\{F\}\(a\_\{j\}\)\\ =\\ \\frac\{1\}\{N\_\{c\}\}\\sum\_\{i=1\}^\{N\_\{c\}\}\\mathbbm\{1\}\\\{t\_\{i\}\\leq a\_\{j\}\\\},\\qquad q\_\{0\}\\ =\\ 0,\\quad q\_\{K\}\\ =\\ 1\.The forward transformation is the piecewise\-linear interpolation of these knots:
gQ\(t\)=qj\+t−ajaj\+1−aj\(qj\+1−qj\),aj≤t≤aj\+1\.g\_\{\\mathrm\{Q\}\}\(t\)\\ =\\ q\_\{j\}\+\\frac\{t\-a\_\{j\}\}\{a\_\{j\+1\}\-a\_\{j\}\}\(q\_\{j\+1\}\-q\_\{j\}\),\\qquad a\_\{j\}\\ \\leq\\ t\\ \\leq\\ a\_\{j\+1\}\.Values above the largest time are mapped to11\. The inverse transformation swaps the axes and linearly interpolates from quantile space back to raw time:
gQ−1\(q\)=aj\+q−qjqj\+1−qj\(aj\+1−aj\),qj≤q≤qj\+1\.g\_\{\\mathrm\{Q\}\}^\{\-1\}\(q\)\\ =\\ a\_\{j\}\+\\frac\{q\-q\_\{j\}\}\{q\_\{j\+1\}\-q\_\{j\}\}\(a\_\{j\+1\}\-a\_\{j\}\),\\qquad q\_\{j\}\\ \\leq\\ q\\ \\leq\\ q\_\{j\+1\}\.For this transformation, the model\-space range is fixed to\[0,1\]\[0,1\]\. Uniform bins in quantile space become adaptive raw\-time bins: regions with many observed times receive finer resolution, while sparse regions receive wider bins\.
Thetime2quantiletransformation makes no parametric assumption on the time distribution and is robust to changes in time units or monotone rescalings of raw time\. However, because it is fitted from the empirical distribution of\{ti\}i=1Nc\\\{t\_\{i\}\\\}\_\{i=1\}^\{N\_\{c\}\}, it is context\-local and cannot resolve the tail shape beyondmaxiti\\max\_\{i\}t\_\{i\}; all larger times are mapped to quantile11, with probability mass beyond this point represented only by the final residual histogram bin\.
#### Summary\.
The two transformations reflect complementary design choices\. Thelognormal2normaltransformation provides a smooth positive\-time coordinate with parametric tail extrapolation, whiletime2quantileprovides a fully context\-adaptive coordinate that normalizes all tasks to the common interval\[0,1\]\[0,1\]\. Both transformations preserve time ordering and allow SurvivalPFN to allocate discretization resolution more effectively than raw\-time binning\.
### D\.4Training Objective
SurvivalPFN is trained to predict a discretized PPD over transformed time\. For a query with indicatorδ~∗\\widetilde\{\\delta\}^\{\\ast\}, define the latent supervised target
rθ∗\(δ~∗\)=δ~∗eθ∗\+\(1−δ~∗\)cθ∗\.r^\{\\ast\}\_\{\\theta\}\(\\widetilde\{\\delta\}^\{\\ast\}\)\\ =\\ \\widetilde\{\\delta\}^\{\\ast\}e^\{\\ast\}\_\{\\theta\}\+\\bigl\(1\-\\widetilde\{\\delta\}^\{\\ast\}\\bigr\)c^\{\\ast\}\_\{\\theta\}\.Thus,δ~∗=1\\widetilde\{\\delta\}^\{\\ast\}=1asks for the latent event time, whose posterior predictive tail gives the PPSD, whileδ~∗=0\\widetilde\{\\delta\}^\{\\ast\}=0asks for the latent censoring time, corresponding to the PPCD\.
Letg𝒟θtr:ℝ\+→𝒵g\_\{\\mathcal\{D\}^\{tr\}\_\{\\theta\}\}:\\mathbb\{R\}\_\{\+\}\\to\\mathcal\{Z\}be the monotone transformation fitted from the observed context times in𝒟θtr\\mathcal\{D\}^\{tr\}\_\{\\theta\}, and let\{ℐℓ\}ℓ=1L\\\{\\mathcal\{I\}\_\{\\ell\}\\\}\_\{\\ell=1\}^\{L\}denote the ordered bins in transformed\-time space\. Define the bin\-index map
κ𝒟θtr\(r\)=ℓifg𝒟θtr\(r\)∈ℐℓ,\\kappa\_\{\\mathcal\{D\}^\{tr\}\_\{\\theta\}\}\(r\)\\ =\\ \\ell\\quad\\text\{if\}\\quad g\_\{\\mathcal\{D\}^\{tr\}\_\{\\theta\}\}\(r\)\\in\\mathcal\{I\}\_\{\\ell\},with boundary clipping handled by the same convention used in implementation\. Equivalently, define the one\-hot target
bℓ,𝒟θtr\(r\)=𝕀\{κ𝒟θtr\(r\)=ℓ\},ℓ=1,…,L\.b\_\{\\ell,\\mathcal\{D\}^\{tr\}\_\{\\theta\}\}\(r\)\\ =\\ \\mathbb\{I\}\\\!\\left\\\{\\kappa\_\{\\mathcal\{D\}^\{tr\}\_\{\\theta\}\}\(r\)=\\ell\\right\\\},\\qquad\\ell=1,\\ldots,L\.
#### Likelihood loss\.
The main SurvivalPFN checkpoint is trained with the one\-hot discrete negative log\-likelihood over transformed\-time bins:
NLL\(rθ∗∥qω\)=−logqω,κ𝒟θtr\(rθ∗\)\(xθ∗,δ~∗,𝒟θtr\)\.\\mathrm\{NLL\}\\\!\\left\(r^\{\\ast\}\_\{\\theta\}\\,\\middle\\\|\\,q\_\{\\omega\}\\right\)\\ =\\ \-\\log q\_\{\\omega,\\kappa\_\{\\mathcal\{D\}^\{tr\}\_\{\\theta\}\}\(r^\{\\ast\}\_\{\\theta\}\)\}\\\!\\left\(x^\{\\ast\}\_\{\\theta\},\\widetilde\{\\delta\}^\{\\ast\},\\mathcal\{D\}^\{tr\}\_\{\\theta\}\\right\)\.Equivalently,
NLL\(rθ∗∥qω\)=−∑ℓ=1Lbℓ,𝒟θtr\(rθ∗\)logqω,ℓ\(xθ∗,δ~∗,𝒟θtr\)\.\\mathrm\{NLL\}\\\!\\left\(r^\{\\ast\}\_\{\\theta\}\\,\\middle\\\|\\,q\_\{\\omega\}\\right\)\\ =\\ \-\\sum\_\{\\ell=1\}^\{L\}b\_\{\\ell,\\mathcal\{D\}^\{tr\}\_\{\\theta\}\}\(r^\{\\ast\}\_\{\\theta\}\)\\log q\_\{\\omega,\\ell\}\\\!\\left\(x^\{\\ast\}\_\{\\theta\},\\widetilde\{\\delta\}^\{\\ast\},\\mathcal\{D\}^\{tr\}\_\{\\theta\}\\right\)\.The corresponding population objective is
ℒNLL\(ω\)=𝔼θ∼π\(⋅\)𝔼𝒟θtr,xθ∗,eθ∗,cθ∗,δ~∗\[−logqω,κ𝒟θtr\(rθ∗\(δ~∗\)\)\(xθ∗,δ~∗,𝒟θtr\)\]\.\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{NLL\}\}\(\\omega\)\\ =\\ \\mathbb\{E\}\_\{\\theta\\sim\\pi\(\\cdot\)\}\\,\\mathbb\{E\}\_\{\\mathcal\{D\}^\{tr\}\_\{\\theta\},\\,x^\{\\ast\}\_\{\\theta\},\\,e^\{\\ast\}\_\{\\theta\},\\,c^\{\\ast\}\_\{\\theta\},\\,\\widetilde\{\\delta\}^\{\\ast\}\}\\left\[\-\\log q\_\{\\omega,\\kappa\_\{\\mathcal\{D\}^\{tr\}\_\{\\theta\}\}\\\!\\left\(r^\{\\ast\}\_\{\\theta\}\(\\widetilde\{\\delta\}^\{\\ast\}\)\\right\)\}\\\!\\left\(x^\{\\ast\}\_\{\\theta\},\\widetilde\{\\delta\}^\{\\ast\},\\mathcal\{D\}^\{tr\}\_\{\\theta\}\\right\)\\right\]\.\(D\.1\)This is the objective used by the best validation checkpoint reported in the main results\.
#### Smoothed cross\-entropy variant\.
We also consider a smoothed cross\-entropy loss, which is used in TabDPT\[[61](https://arxiv.org/html/2605.15488#bib.bib155)\]and CausalPFN\[[2](https://arxiv.org/html/2605.15488#bib.bib93)\]\. Instead of assigning all mass to one bin, the target time is converted into a narrow histogram in transformed\-time space:
aℓ,𝒟θtr\(σ\)\(r\)=∫ℐℓασ\(z;g𝒟θtr\(r\)\)𝑑z,ℓ=1,…,L,a\_\{\\ell,\\mathcal\{D\}^\{tr\}\_\{\\theta\}\}^\{\(\\sigma\)\}\(r\)\\ =\\ \\int\_\{\\mathcal\{I\}\_\{\\ell\}\}\\alpha\_\{\\sigma\}\\\!\\left\(z;\\,g\_\{\\mathcal\{D\}^\{tr\}\_\{\\theta\}\}\(r\)\\right\)dz,\\qquad\\ell=1,\\ldots,L,whereασ\\alpha\_\{\\sigma\}is a narrow smoothing density, implemented as a Gaussian centered atg𝒟θtr\(r\)g\_\{\\mathcal\{D\}^\{tr\}\_\{\\theta\}\}\(r\)with a predefined variance\. The smoothed cross\-entropy loss is
SCEσ\(rθ∗∥qω\)=−∑ℓ=1Laℓ,𝒟θtr\(σ\)\(rθ∗\)logqω,ℓ\(xθ∗,δ~∗,𝒟θtr\)\.\\displaystyle\\mathrm\{SCE\}\_\{\\sigma\}\\\!\\left\(r^\{\\ast\}\_\{\\theta\}\\,\\middle\\\|\\,q\_\{\\omega\}\\right\)\\ =\\ \-\\sum\_\{\\ell=1\}^\{L\}a\_\{\\ell,\\mathcal\{D\}^\{tr\}\_\{\\theta\}\}^\{\(\\sigma\)\}\(r^\{\\ast\}\_\{\\theta\}\)\\log q\_\{\\omega,\\ell\}\\\!\\left\(x^\{\\ast\}\_\{\\theta\},\\widetilde\{\\delta\}^\{\\ast\},\\mathcal\{D\}^\{tr\}\_\{\\theta\}\\right\)\.\(D\.2\)Asσ→0\\sigma\\to 0, the smoothed targetaℓ,𝒟θtr\(σ\)\(r\)a\_\{\\ell,\\mathcal\{D\}^\{tr\}\_\{\\theta\}\}^\{\(\\sigma\)\}\(r\)reduces to the one\-hot targetbℓ,𝒟θtr\(r\)b\_\{\\ell,\\mathcal\{D\}^\{tr\}\_\{\\theta\}\}\(r\), and the smoothed cross\-entropy recovers the discrete NLL\. In our experiments, this loss is treated as an alternative training option and evaluated in the ablation study\.
#### Note\.
This prior\-data likelihood differs from the standard right\-censored survival likelihood in Section[2](https://arxiv.org/html/2605.15488#S2)\. The standard observed\-data likelihood trains on partially observed query outcomes\(t∗,δ∗\)\(t^\{\\ast\},\\delta^\{\\ast\}\)and must account for censored queries through survival\-tail probabilities\. In contrast, SurvivalPFN training uses simulator\-provided latent query timeseθ∗e^\{\\ast\}\_\{\\theta\}orcθ∗c^\{\\ast\}\_\{\\theta\}as supervision\.
### D\.5Inference Procedure
At inference time, SurvivalPFN takes an observed survival dataset𝒟\\mathcal\{D\}as context and one or more query covariatesx∗x^\{\\ast\}\. For survival prediction, we set the query indicator toδ~∗=1\\widetilde\{\\delta\}^\{\\ast\}=1, so that the model outputs a discretized posterior predictive event\-time distribution
qω\(⋅\|x∗,δ~∗=1,𝒟\)=\[qω,ℓ\(x∗,δ~∗=1,𝒟\)\]ℓ=1L\.q\_\{\\omega\}\\\!\\left\(\\cdot\\,\\middle\|\\,x^\{\\ast\},\\widetilde\{\\delta\}^\{\\ast\}=1,\\mathcal\{D\}\\right\)\\ =\\ \\left\[q\_\{\\omega,\\ell\}\(x^\{\\ast\},\\widetilde\{\\delta\}^\{\\ast\}=1,\\mathcal\{D\}\)\\right\]\_\{\\ell=1\}^\{L\}\.This requires only a single forward pass and does not involve dataset\-specific gradient updates, posterior sampling, or hyperparameter tuning\. Letg𝒟g\_\{\\mathcal\{D\}\}denote the monotone time transformation fitted from the observed times in𝒟\\mathcal\{D\}, and let\{ℐℓ\}ℓ=1L\\\{\\mathcal\{I\}\_\{\\ell\}\\\}\_\{\\ell=1\}^\{L\}denote the discretized bins in the transformed time space\. The predicted posterior predictive survival distribution \(PPSD\) is obtained by summing the probability mass assigned to bins whose raw\-time support lies abovett:
S^ω\(t∣x∗,𝒟\)=∑ℓ:g𝒟−1\(ℐℓ\)\>tqω,ℓ\(x∗,δ~∗=1,𝒟\),\\widehat\{S\}\_\{\\omega\}\(t\\mid x^\{\\ast\},\\mathcal\{D\}\)\\ =\\ \\sum\_\{\\ell:\\,g\_\{\\mathcal\{D\}\}^\{\-1\}\(\\mathcal\{I\}\_\{\\ell\}\)\>t\}q\_\{\\omega,\\ell\}\\\!\\left\(x^\{\\ast\},\\widetilde\{\\delta\}^\{\\ast\}=1,\\mathcal\{D\}\\right\),which is implemented as a cumulative tail sum over the discretized time bins\. Similarly, settingδ~∗=0\\widetilde\{\\delta\}^\{\\ast\}=0yields the PPCD \(although we generally do not care about PPCD during inference\)\.
## Appendix EBenchmarking Details
### E\.1Baseline Model Details
In this study, we compare SurvivalPFN against a board range of 20 representative survival analysis methods spanning classical statistical models, tree\-based ensembles, discrete\-time neural survival models, continuous\-time neural survival models, and methods using TFMs\. Below, we briefly summarize each baseline, including its core modeling idea and the implementation details used in our benchmark\. A side\-to\-side overview of their methodological properties is provided in Table[2](https://arxiv.org/html/2605.15488#A5.T2)\.
For each column in the table,*continuous\-time*indicates whether the method explicitly parameterizes a smooth survival distribution over continuous time, without relying on post\-hoc interpolation over a discrete time grid\.*\(Semi\-\)parametric*indicates whether the method specifies the hazard, density, or survival function through an explicit parametric or semi\-parametric form\. The*proportional hazards \(PH\)*assumption indicates that covariates act multiplicatively on a shared baseline hazard, so that hazard ratios between individuals are constant over time\. Finally, the*ensemble/mixture*column indicates whether the method combines a finite collection of base learners, components, or experts, such as trees in an ensemble or distributions in a finite mixture model\.
In particular, SurvivalMDN is marked parametric because it uses a finite mixture of Gaussians, and DSM is marked parametric because it uses a finite mixture of Weibulls; only an infinite\-mixture construction would be nonparametric\.
Table 2:Qualitative comparison of survival models considered in our benchmark\. Columns indicate whether a method explicitly parameterizes a continuous\-time \(cont\. time\) survival distribution, imposes a \(semi\-\)parametric form, assumes proportional hazards \(PH\), uses an ensemble or finite\-mixture construction, or belongs to the tabular foundation model \(TMF\) family\.ModelCont\.time\(Semi\-\)parametricPHassump\.Ensemble/mixtureTFMDistinguishing featurePrior\-fitted and tabular foundation modelsSurvivalPFN✗✗✗✗✓Novel in\-context Bayesian survival estimator\.StaticSurvivalTFM✗✗✗✗✓Handling right\-censoring with tabular foundation models\.Classical survival modelsCoxPH✗✓✓✗✗Semi\-parametric Cox proportional hazards model\.CoxNet✗✓✓✗✗Regularized Cox proportional hazards model\.cSVR✗✗✗✗✗Censored support\-vector regression baseline\.Tree\-based modelsGB✗✓✓✓✗Gradient\-boosted Cox proportional hazards model\.CWGB✗✓✓✓✗Censoring\-weighted gradient\-boosted Cox model\.RSF✗✗✗✓✗Random survival forest for nonlinear effects and interactions\.Neural discrete\-time modelsDeepHit✗✗✗✗✗Neural discrete\-time model emphasizing discrimination\.DeepSurv✗✓✓✗✗Neural Cox proportional hazards model\.MTLR✗✗✗✗✗Multi\-task logistic regression for discrete\-time survival\.Nnet\-survival✗✗✗✗✗Discrete\-time neural survival model\.CoxTime✗✓✗✗✗Neural Cox model with time\-varying effects\.IWSG✗✗✗✗✗Explicitly models the censoring mechanism\.CQRNN✗✗✗✗✗Quantile\-regression survival baseline\.BNN\-MTLR✗✗✗✗✗Bayesian neural extension of MTLR\.Neural continuous\-time modelsDSM✓✓✗✓✗Finite parametric \(Weibull\) mixture components\.SumoNet✓✗✗✗✗Continuous survival\-function via automatic differentiation\.SurvivalMDN✓✓✗✓✗Finite parametric \(Gaussian\) mixture components\.DeepAFT–Weibull✓✓✓✗✗Continuous Weibull accelerated failure time model\.DeepAFT–Log\-logistic✓✓✗✗✗Continuous log\-logistic accelerated failure time model\.1. 1\.SurvivalPFN\. SurvivalPFN is the model we developed in this paper\. See description in Section[3](https://arxiv.org/html/2605.15488#S3)and Appendix[D](https://arxiv.org/html/2605.15488#A4)\.
2. 2\.StaticSurvivalTFM\[[46](https://arxiv.org/html/2605.15488#bib.bib130)\]\. StaticSurvivalTFM is a framework that converts any binary classifier into a survival predictor; we provide methodological details in Appendix[H](https://arxiv.org/html/2605.15488#A8)\. For a fair comparison with SurvivalPFN, we use TabDPT\[[61](https://arxiv.org/html/2605.15488#bib.bib155)\]as the backbone binary classifier in the main benchmark\. We further study the effect of replacing this backbone with MITRA\[[106](https://arxiv.org/html/2605.15488#bib.bib159)\]\(the reported best version inKimet al\.\[[46](https://arxiv.org/html/2605.15488#bib.bib130)\]\), with results reported in Appendix[G\.4](https://arxiv.org/html/2605.15488#A7.SS4)\.
3. 3\.Cox proportional hazards model \(CoxPH\)\[[11](https://arxiv.org/html/2605.15488#bib.bib38)\]\. CoxPH is the classical semi\-parametric proportional\-hazards model, estimating covariate effects through the Cox partial likelihood while leaving the baseline hazard unspecified\. We used thescikit\-survivalimplementation\[[78](https://arxiv.org/html/2605.15488#bib.bib54)\], which produces risk scores and can recover survival curves using an estimated baseline hazard\[[5](https://arxiv.org/html/2605.15488#bib.bib122)\]\.
4. 4\.Elastic\-net Cox proportional hazards model \(CoxNet\)\[[94](https://arxiv.org/html/2605.15488#bib.bib121)\]\. CoxNet regularizes the Cox partial likelihood with an elastic\-net penalty, making Cox\-style modeling more stable in high\-dimensional settings\. We used thescikit\-survivalimplementation\[[78](https://arxiv.org/html/2605.15488#bib.bib54)\], with the same baseline hazard estimator as CoxPH\.
5. 5\.Censored Support Vector Regression \(cSVR\)\[[77](https://arxiv.org/html/2605.15488#bib.bib120)\]\. cSVR extends support\-vector regression to right\-censored outcomes by treating censored observations through inequality constraints or ranking\-style losses\. Specifically, we use theFastSurvivalSVMmethod inscikit\-survival\[[78](https://arxiv.org/html/2605.15488#bib.bib54)\]\. Because cSVR only outputs a scalar regression time rather than a full survival distribution, it cannot produce valid values for evaluation metrics that require an entire survival curve or event\-time distribution\.
6. 6\.Gradient\-boosted Cox model \(GB\)\[[86](https://arxiv.org/html/2605.15488#bib.bib124)\]\. GB fits an additive risk model by gradient boosting under a Cox partial\-likelihood objective\. We used thescikit\-survivalimplementation\[[78](https://arxiv.org/html/2605.15488#bib.bib54)\], which yields a boosted risk score and survival estimates through the fitted baseline hazard as CoxPH\.
7. 7\.Component\-wise gradient\-boosted Cox model \(CWGB\)\[[40](https://arxiv.org/html/2605.15488#bib.bib123)\]\. CWGB is a component\-wise variant of gradient\-boosted Cox regression, where weak learners update individual covariate components under a Cox\-style objective\. We used thescikit\-survivalimplementation\[[78](https://arxiv.org/html/2605.15488#bib.bib54)\]\.
8. 8\.Random Survival Forests \(RSF\)\[[41](https://arxiv.org/html/2605.15488#bib.bib49)\]\. RSF is an ensemble of survival trees that partitions the feature space using survival\-specific split criteria and averages terminal\-node survival estimates across trees\. We use thescikit\-survivalimplementation\[[78](https://arxiv.org/html/2605.15488#bib.bib54)\]\.
9. 9\.DeepHit\[[55](https://arxiv.org/html/2605.15488#bib.bib30)\]\. DeepHit is a discrete\-time neural survival model that directly parameterizes the probability mass function over event\-time bins\. We used thepycoximplementation; its objective combines a likelihood term with a C\-index\-like ranking term, explicitly encouraging discrimination\.
10. 10\.DeepSurv\[[45](https://arxiv.org/html/2605.15488#bib.bib29)\]\. DeepSurv replaces the linear predictor in CoxPH with a neural network while retaining the Cox partial\-likelihood objective and proportional\-hazards structure\. We reimplemented it following the public PyTorch implementation \([https://github\.com/czifan/DeepSurv\.pytorch](https://github.com/czifan/DeepSurv.pytorch)\) and add baseline hazard estimation as in CoxPH\.
11. 11\.Multi\-task Logistic Regression \(MTLR\)\[[103](https://arxiv.org/html/2605.15488#bib.bib82),[19](https://arxiv.org/html/2605.15488#bib.bib83)\]\. MTLR parameterizes the survival distribution over a discrete time grid using a sequence of dependent logistic regressors\. We reimplemented MTLR following the public PyTorch implementation \([https://github\.com/mkazmier/torchmtlr](https://github.com/mkazmier/torchmtlr)\)\.
12. 12\.Nnet\-survival\[[4](https://arxiv.org/html/2605.15488#bib.bib51),[22](https://arxiv.org/html/2605.15488#bib.bib52)\]\. Nnet\-survival models discrete\-time conditional hazards with a neural network and trains by a binary cross\-entropy\-style survival likelihood over time intervals\. We used thepycoximplementation, converting predicted discrete hazards into survival curves for distributional evaluation\.
13. 13\.CoxTime\[[53](https://arxiv.org/html/2605.15488#bib.bib39)\]\. CoxTime generalizes neural Cox regression by allowing the relative risk to depend on both covariates and time, thereby relaxing the proportional\-hazards assumption\. We used thepycoximplementation and recovered survival curves from the learned time\-dependent risk function and the same baseline hazard estimator as CoxPH\.
14. 14\.Inverse\-Weighted Survival Games \(IWSG\)\[[31](https://arxiv.org/html/2605.15488#bib.bib126)\]\. IWSG explicitly models both the failure\-time and censoring distributions through an inverse\-probability\-of\-censoring\-weighted \(IPCW\) game objective\. We reimplemented it from the official IWSG codebase \([https://github\.com/rajesh\-lab/Inverse\-Weighted\-Survival\-Games](https://github.com/rajesh-lab/Inverse-Weighted-Survival-Games)\)\.
15. 15\.Censored Quantile Regression Neural Network \(CQRNN\)\[[73](https://arxiv.org/html/2605.15488#bib.bib127)\]\. CQRNN directly predicts event\-time quantiles under censoring, providing a distribution\-free way to represent time\-to\-event uncertainty\. We reimplemented it following the official CQRNN codebase \([https://github\.com/TeaPearce/Censored\_Quantile\_Regression\_NN](https://github.com/TeaPearce/Censored_Quantile_Regression_NN)\) and converted the predicted quantile function into a monotone survival curve when distributional metrics \(IBS and D\-calibration\) were required\.
16. 16\.Bayesian Neural Network Multi\-task Logistic Regression \(BNN\-MTLR\)\[[80](https://arxiv.org/html/2605.15488#bib.bib98)\]\. BNN\-MTLR extends MTLR with Bayesian neural\-network uncertainty, producing a PPD over discrete survival curves\. We reimplemented it from the official BNN\-MTLR codebase \([https://github\.com/shi\-ang/BNN\-ISD](https://github.com/shi-ang/BNN-ISD)\)\.
17. 17\.Deep Survival Machines \(DSM\)\[[68](https://arxiv.org/html/2605.15488#bib.bib32)\]\. DSM parameterizes the event\-time distribution as a finite mixture of Weibull components, with neural networks producing mixture weights and component parameters\. We reimplemented DSM using the code inauton\-survivalpackage\[[69](https://arxiv.org/html/2605.15488#bib.bib128)\]\.
18. 18\.Survival Monotonic Network \(SuMoNet\)\[[87](https://arxiv.org/html/2605.15488#bib.bib55)\]\. SuMoNet models continuous\-time survival distributions with monotonic neural networks, using automatic differentiation to obtain valid densities from the learned cumulative distribution function\. We reimplemented it following the official SuMoNet codebase \([https://github\.com/MrHuff/Sumo\-Net](https://github.com/MrHuff/Sumo-Net)\)\.
19. 19\.Survival Mixture Density Network \(SurvivalMDN\)\[[32](https://arxiv.org/html/2605.15488#bib.bib125)\]\. SurvivalMDN models the event\-time distribution using a finite mixture density network, providing a flexible but still finite\-dimensional parametric mixture representation\. We reimplemented it from the official SurvivalMDN codebase \([https://github\.com/XintianHan/Survival\-MDN](https://github.com/XintianHan/Survival-MDN)\)\.
20. 20\.Neural Weibull accelerated failure\-time model \(DeepAFT\-Weibull\)\[[71](https://arxiv.org/html/2605.15488#bib.bib129)\]\. DeepAFT\-Weibull uses a neural network to parameterize a Weibull accelerated failure\-time distribution for right\-censored data\. We reimplemented the model following the paper\.
21. 21\.Neural log\-logistic accelerated failure\-time model \(DeepAFT\-Loglogistic\)\[[71](https://arxiv.org/html/2605.15488#bib.bib129)\]\. DeepAFT\-Loglogistic similarly uses a neural network to parameterize a log\-logistic accelerated failure\-time distribution\. We reimplemented the model following the paper\.
### E\.2Unified Time Grid for Consistent Evaluation
All methods in our benchmark are evaluated through a common prediction object\. After fitting, each model is converted into a survival\-probability matrix
𝐒^=\[S^i\(tj\)\]i=1,…,ntest;j=1,…,m,𝒢=\{t1<⋯<tm\},\\displaystyle\\widehat\{\\mathbf\{S\}\}\\ =\\ \\left\[\\widehat\{S\}\_\{i\}\(t\_\{j\}\)\\right\]\_\{i=1,\\ldots,n\_\{\\mathrm\{test\}\};\\,j=1,\\ldots,m\},\\qquad\\mathcal\{G\}\\ =\\ \\\{t\_\{1\}<\\cdots<t\_\{m\}\\\},whereS^i\(tj\)\\widehat\{S\}\_\{i\}\(t\_\{j\}\)denotes the predicted survival probability for test individualiiat timetjt\_\{j\}\. All metrics are then computed from\(𝒢,𝐒^\)\(\\mathcal\{G\},\\widehat\{\\mathbf\{S\}\}\)\. Thus, differences across methods enter only through how their native time representation is constructed before prediction\.
To avoid evaluation leakage, all time grids are defined using only the data available during training; test\-set event times are never used to define the evaluation support\.
For models that require a predefined binning for making discrete\-time survival function prediction, including MTLR, DeepHit, Nnet\-survival, BNN\-MTLR and StaticSurvivalTFM, we use an event\-time quantile grid\. Let
𝒯E=\{Ti:δi=1,i∈𝒟tr\}\\displaystyle\\mathcal\{T\}\_\{E\}\\ =\\ \\\{T\_\{i\}:\\delta\_\{i\}=1,\\ i\\in\\mathcal\{D\}^\{tr\}\\\}denote the uncensored event times in the fitting data\. When the number of bins is not specified, we set
K=⌈\|𝒯E\|⌉,\\displaystyle K\\ =\\ \\left\\lceil\\sqrt\{\|\\mathcal\{T\}\_\{E\}\|\}\\right\\rceil,and define the discrete support by the unique empirical quantiles
𝒢disc=unique\{Q𝒯E\(k−1K−1\):k=1,…,K\},\\displaystyle\\mathcal\{G\}\_\{\\mathrm\{disc\}\}\\ =\\ \\operatorname\{unique\}\\left\\\{Q\_\{\\mathcal\{T\}\_\{E\}\}\\\!\\left\(\\frac\{k\-1\}\{K\-1\}\\right\):\\ k=1,\\ldots,K\\right\\\},whereQ𝒯EQ\_\{\\mathcal\{T\}\_\{E\}\}is the empirical quantile function\. This grid provides the native discretization on which discrete\-time models are trained, with each bin contains the exact same number of uncensored instances\.
For DeepHit and Nnet\-survival, the implementation requries that the first bin location must smaller than the smallest time in the data\. Therefore, we additionally shift the first bin location:
b1←max\{mini∈𝒟tr\{Ti−ϵ\},0\},ϵ=10−5\.\\displaystyle b\_\{1\}\\ \\leftarrow\\ \\max\\left\\\{\\min\_\{i\\in\\mathcal\{D\}^\{tr\}\}\\\{T\_\{i\}\-\\epsilon\\\},\\ 0\\right\\\},\\qquad\\epsilon=10^\{\-5\}\.
Continuous\-time models do not require a training\-time discretization\. Nevertheless, curve\-based evaluation still requires a finite query set\. For these methods, we define the evaluation support from the observed fitting durations:
𝒢cont=\{0\}∪unique\{Ti:i∈𝒟tr\},\\displaystyle\\mathcal\{G\}\_\{\\mathrm\{cont\}\}\\ =\\ \\\{0\\\}\\cup\\operatorname\{unique\}\\\{T\_\{i\}:\\ i\\in\\mathcal\{D\}^\{tr\}\\\},\(E\.1\)with duplicate zeros removed if necessary\. For quantile\-output models such as CQRNN, predicted quantile functions are first converted to survival probabilities on the same support before metric computation\.
This protocol preserves each model class’s natural time parameterization: discrete\-time methods learn on event\-time quantile bins, whereas continuous\-time methods are queried on the empirical training\-time support\.
### E\.3Hyperparameter Tuning
We tune hyperparameters only for neural\-network baselines whose performance depends on optimizer and architecture choices\. Classicalscikit\-survivalmodels \(CoxPH, CoxNet, cSVR, GB, CWGB, and RSF\), StaticSurvivalTFM, and SurvivalPFN are kept fixed at their benchmark settings\.
For each outer splitrr, hyperparameter selection is performed using only the training\-side data,
𝒟\(r\)tr\+val=𝒟\(r\)tr∪𝒟\(r\)val,\\displaystyle\\mathcal\{D\}^\{tr\+val\}\_\{\(r\)\}\\ =\\ \\mathcal\{D\}^\{tr\}\_\{\(r\)\}\\cup\\mathcal\{D\}^\{val\}\_\{\(r\)\},and the test set is never used for either tuning or discretization\. For each modelmm, we sampleR=10R=10configurations from its search spaceΛm\\Lambda\_\{m\}and estimate their performance by shuffledF=5F=5\-fold cross\-validation on𝒟\(r\)tr\+val\\mathcal\{D\}^\{tr\+val\}\_\{\(r\)\}, using the outer split seed for reproducibility\. The selected configuration is
λm,\(r\)∗=argminλ∈\{λ1,…,λR\}1F∑f=1Fℒ\(f\)val\(λ;m\),\\displaystyle\\lambda\_\{m,\(r\)\}^\{\\ast\}\\ =\\ \\arg\\min\_\{\\lambda\\in\\\{\\lambda\_\{1\},\\ldots,\\lambda\_\{R\}\\\}\}\\frac\{1\}\{F\}\\sum\_\{f=1\}^\{F\}\\mathcal\{L\}^\{val\}\_\{\(f\)\}\(\\lambda;m\),whereℒ\(f\)val\\mathcal\{L\}^\{val\}\_\{\(f\)\}denotes the model\-specific objective loss on the validation on foldff\. Configurations that fail to complete training are treated as infeasible and assigned infinite validation loss\. After selection, modelmmis refit on𝒟\(r\)tr\+val\\mathcal\{D\}^\{tr\+val\}\_\{\(r\)\}usingλm,\(r\)∗\\lambda\_\{m,\(r\)\}^\{\\ast\}, and evaluated once on the held\-out test split\.
The full model\-specific search spaces are summarized in Table[3](https://arxiv.org/html/2605.15488#A5.T3)\. All remaining optimization settings are fixed across tuning trials: AdamW optimizer, batch size256256, ReLU activations, no normalization layer, early stopping, and a maximum budget of10,00010\{,\}000epochs\. For models that require an explicit learning\-rate floor, we set
ηmin=10−3η,\\displaystyle\\eta\_\{\\min\}\\ =\\ 10^\{\-3\}\\eta,whereη\\etais the tuned learning rate\. Model\-specific constants not included in Table[3](https://arxiv.org/html/2605.15488#A5.T3)are fixed throughout tuning\.
Table 3:Hyperparameter search spaces for tuned neural baselines\. The table defines the shared base space and model\-specific extensions\.ModelsHyperparameter meaningsSearch spaceBase hyperparametersDeepHitDeepSurvMTLRNnet\-survivalCoxTimeDeepAFT\-WeibullDeepAFT\-Loglogisticlr: learning rate;weight\_decay: weight decay;neurons: hidden architecture;dropout: dropout probability\{lr:\{10−4,10−3,10−2\}\\\{10^\{\-4\},10^\{\-3\},10^\{\-2\}\\\};weight\_decay:\{10−3,10−2,10−1\}\\\{10^\{\-3\},10^\{\-2\},10^\{\-1\}\\\};neurons:\{\[64\],\[64,32\],\[64,64,16\],\\\{\[64\],\[64,32\],\[64,64,16\],\[32\],\[32,16\],\[32,32,16\],\[32\],\[32,16\],\[32,32,16\],\[16\],\[16,8\],\[8\],\[\]\}\[16\],\[16,8\],\[8\],\[\]\\\};dropout:\{0\.0,0\.4,0\.6\}\\\{0\.0,0\.4,0\.6\\\}\}Model\-specific hyperparametersCQRNNn\_quantiles: number of predicted quantile levels\.\{\\\{Base;n\_quantiles:\{9,19,39\}\\\{9,19,39\\\}\}\\\}SumoNetneurons\_alter: hidden architecture for the censoring branch\.\{\\\{Base;neurons\_alter=neurons\}\\\}IWSGneurons\_alter: hidden architecture for the censoring branch\.\{\\\{Base;neurons\_alter=neurons\}\\\}SurvivalMDNn\_mixtures: number of mixture components\.\{\\\{Base;n\_mixtures:\{3,5,10,50\}\\\{3,5,10,50\\\}\}\\\}DSMn\_mixtures: number of mixture components\.\{\\\{Base;n\_mixtures:\{3,5,10,50\}\\\{3,5,10,50\\\}\}\\\}BNN\-MTLRpi: prior mixture probability\.\{\\\{Base;pi:\{0\.2,0\.5,0\.8\}\\\{0\.2,0\.5,0\.8\\\}\}\\\}
### E\.4Evaluation Metrics
We evaluate the model’s performance on the held\-out test set𝒟te\\mathcal\{D\}^\{te\}\. For a fitted model, letS^E∣X\(u∣xi\)\\widehat\{S\}\_\{E\\mid X\}\(u\\mid x\_\{i\}\)denote the predicted event\-time survival function for subjectii\. When a point prediction is required, we use the predicted median survival time
e^i=inf\{u:S^E∣X\(u∣xi\)≤1/2\},\\displaystyle\\widehat\{e\}\_\{i\}\\ =\\ \\inf\\\!\\left\\\{u:\\widehat\{S\}\_\{E\\mid X\}\(u\\mid x\_\{i\}\)\\leq 1/2\\right\\\},and define the corresponding risk score as
r^i=−e^i,\\displaystyle\\widehat\{r\}\_\{i\}\\ =\\ \-\\widehat\{e\}\_\{i\},so that higher risk corresponds to shorter predicted survival\.
#### Concordance Index\.
Discrimination measures whether a model correctly orders subjects by risk\. We use Harrell’s concordance index\[[33](https://arxiv.org/html/2605.15488#bib.bib115)\], computed over comparable pairs in which subjectiiexperiences an observed event before subjectjj:
CI=∑i=1n∑j≠iδi1\[ti<tj\]1\[r^i\>r^j\]∑i=1n∑j≠iδi1\[ti<tj\]\.\\displaystyle\\operatorname\{CI\}\\ =\\ \\frac\{\\sum\_\{i=1\}^\{n\}\\sum\_\{j\\neq i\}\\delta\_\{i\}\\,\\mathbbm\{1\}\[t\_\{i\}<t\_\{j\}\]\\,\\mathbbm\{1\}\\big\[\\widehat\{r\}\_\{i\}\>\\widehat\{r\}\_\{j\}\\big\]\}\{\\sum\_\{i=1\}^\{n\}\\sum\_\{j\\neq i\}\\delta\_\{i\}\\,\\mathbbm\{1\}\[t\_\{i\}<t\_\{j\}\]\}\.Higher CI values indicate better performance\.
#### Integrated Brier Score\.
The Brier score evaluates probabilistic accuracy at a target timeuu\. Because the event status atuumay be unknown for subjects censored beforeuu, we use inverse\-probability\-of\-censoring weighting \(IPCW\)\[[26](https://arxiv.org/html/2605.15488#bib.bib131),[89](https://arxiv.org/html/2605.15488#bib.bib132)\]\. LetS^C\\widehat\{S\}\_\{C\}denote the Kaplan\-Meier estimate of the marginal censoring survival function, estimated from the training\-side data\. The IPCW Brier score is
BS\(u\)=1n∑i=1n\[δi𝟙\[ti≤u\]⋅S^E∣X\(u∣xi\)2S^C\(ti\)\+𝟙\[ti\>u\]\(1−S^E∣X\(u∣xi\)\)2S^C\(u\)\]\.\\displaystyle\\operatorname\{BS\}\(u\)\\ =\\ \\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\left\[\\frac\{\\delta\_\{i\}\\mathbbm\{1\}\[t\_\{i\}\\leq u\]\\cdot\\widehat\{S\}\_\{E\\mid X\}\(u\\mid x\_\{i\}\)^\{2\}\}\{\\widehat\{S\}\_\{C\}\(t\_\{i\}\)\}\+\\frac\{\\mathbbm\{1\}\[t\_\{i\}\>u\]\\left\(1\-\\widehat\{S\}\_\{E\\mid X\}\(u\\mid x\_\{i\}\)\\right\)^\{2\}\}\{\\widehat\{S\}\_\{C\}\(u\)\}\\right\]\.The integrated Brier score averages this quantity over an evaluation horizonτ\\tau:
IBS=1τ∫0τBS\(u\)𝑑u\.\\displaystyle\\operatorname\{IBS\}\\ =\\ \\frac\{1\}\{\\tau\}\\int\_\{0\}^\{\\tau\}\\operatorname\{BS\}\(u\)\\,du\.\(E\.2\)In our experiments,τ\\tauis chosen from the training\-side event\-time support, so that the evaluation horizon is not determined by the held\-out test set\. Lower IBS indicates better performance\.
#### Mean Absolute Error\.
To evaluate point prediction of event time, we use the mean absolute error based on pseudo\-observations \(MAE\-PO\)\[[79](https://arxiv.org/html/2605.15488#bib.bib133)\]\. For uncensored subjects, the observed event timetit\_\{i\}can be used directly\. For censored subjects, it constructs a pseudo\-observatione~i\\widetilde\{e\}\_\{i\}that estimates the subject’s contribution to the marginal Kaplan\-Meier survival curve, and then weights the corresponding error by a confidence weightwiw\_\{i\}\. The resulting error takes the form
MAE\-PO=∑i=1nwi\|e^i−e~i\|∑i=1nwi\.\\displaystyle\\operatorname\{MAE\\text\{\-\}PO\}\\ =\\ \\frac\{\\sum\_\{i=1\}^\{n\}w\_\{i\}\\,\\left\|\\widehat\{e\}\_\{i\}\-\\widetilde\{e\}\_\{i\}\\right\|\}\{\\sum\_\{i=1\}^\{n\}w\_\{i\}\}\.This produces an event\-time error metric that can use both uncensored and censored test subjects\. Lower MAE value indicates better performance\.
#### Distribution Calibration\.
Distribution calibration \(D\-calibration\) evaluates whether the predicted survival distribution is calibrated over the full time axis\[[29](https://arxiv.org/html/2605.15488#bib.bib134)\]\. For an uncensored subject, define the probability integral transform value
ui=S^E∣X\(ti∣xi\)\.\\displaystyle u\_\{i\}\\ =\\ \\widehat\{S\}\_\{E\\mid X\}\(t\_\{i\}\\mid x\_\{i\}\)\.If the predicted survival distributions are calibrated, then\{ui:δi=1\}\\\{u\_\{i\}:\\delta\_\{i\}=1\\\}should follow a standard uniform distribution on\[0,1\]\[0,1\]\. In practice, we partition the probability range\[0,1\]\[0,1\]into1010equal\-width bins\. Uncensored subjects contribute to the bin containingS^E∣X\(ti∣xi\)\\widehat\{S\}\_\{E\\mid X\}\(t\_\{i\}\\mid x\_\{i\}\)\. For censored subjects, the event time is only known to satisfyei\>tie\_\{i\}\>t\_\{i\}, so their contribution is “blurred” over the portion of the probability scale consistent with this information, namely\[0,S^E∣X\(ti∣xi\)\]\\left\[0,\\widehat\{S\}\_\{E\\mid X\}\(t\_\{i\}\\mid x\_\{i\}\)\\right\]\. D\-calibration is then assessed by the chi\-square statistic over the numbers in all the bins\. We report the chi\-square statistic for each experiments, a smaller statistic indicates less evidence against distribution calibration\.
#### Log\-rank Reliability Test\.
We also use a log\-rank goodness\-of\-fit test to assess whether predicted event times are statistically aligned with the observed time\-to\-event data\. The test compares the observed test sample
𝒜obs=\{\(ti,δi\)\}i=1n\\displaystyle\\mathcal\{A\}\_\{\\mathrm\{obs\}\}\\ =\\ \\\{\(t\_\{i\},\\delta\_\{i\}\)\\\}\_\{i=1\}^\{n\}with the predicted event\-time sample
𝒜pred=\{\(e^i,δ^i=1\)\}i=1n,\\displaystyle\\mathcal\{A\}\_\{\\mathrm\{pred\}\}\\ =\\ \\left\\\{\\left\(\\widehat\{e\}\_\{i\},\\widehat\{\\delta\}\_\{i\}=1\\right\)\\right\\\}\_\{i=1\}^\{n\},where predicted median survival times are treated as uncensored event\-time predictions\. We report the log\-rank statistic for each experiments, a smaller statistic indicate closer agreement between the predicted event\-time distribution and the observed test outcomes\.
All metrics are computed using theSurvivalEVALpackage\[[82](https://arxiv.org/html/2605.15488#bib.bib114)\]\.
### E\.5Compute Environment and Runtime Protocol
All benchmark experiments were executed on a Slurm\-managed GPU cluster\. Each experiment was submitted as an independent job with a fixed resource budget: one NVIDIA L40S GPU \(48 GB memory\), one CPU core from Intel Xeon Gold 6448Y, 16 GB of system memory, and a maximum wall\-clock time of 72 hours\. The same resource allocation was used across models and datasets for fair evaluation\.
The 72\-hour wall\-clock limit includes all computation required for a benchmark run, including model fitting, hyperparameter tuning when enabled, final refitting, prediction on the held\-out test set, and metric computation\. If a run did not complete successfully within this budget, we treated the corresponding model\-dataset run as failure\. Failed runs were deemed as the worst \(or equally worst if multiple models failed on this dataset\) in the ranking procedure described in Appendix[E\.6](https://arxiv.org/html/2605.15488#A5.SS6)\.
### E\.6Performance Reporting Protocol
We evaluate each model on repeated random train/validation/test splits\. Each dataset is evaluated overR=10R=10repetitions\. The experiment seed is set from 0 to 9 to these repetition, to support data split and model initialization \(if needed\)\.
For each splitrr, the dataset is divided into a 70%𝒟\(r\)tr\+val\\mathcal\{D\}^\{tr\+val\}\_\{\(r\)\}and a 30%𝒟\(r\)te\\mathcal\{D\}^\{te\}\_\{\(r\)\}\. We compute performance scores on the held\-out test set via the metrics described in Appendix[E\.4](https://arxiv.org/html/2605.15488#A5.SS4)\. We report the mean and standard deviation across 10 split\.
For dataset\-level comparisons, each model is ranked separately within each dataset and metric\. Rank11is assigned to the best value, with ties receiving the minimum tied rank\. If a model does not produce a valid result for a given dataset\-metric pair, it is assigned one rank below the worst completed method for that pair\.
The cSVR baseline produces only a scalar event\-time prediction rather than a full survival distribution\. Consequently, distributional metrics such as IBS and D\-calibration are not applicable to cSVR on any dataset\.
Across the full benchmark, we evaluated2222models on6161datasets, yielding21×61=128121\\times 61=1281model\-dataset runs\. Among these,2828runs failed to produce valid results, corresponding to a failure rate of2\.19%2\.19\\%\. We summarize the failures below\.
- •CoxPHfailed on1313datasets: micro\.censure, nki70, stagec, BMT, cancer, zinc, BCCardiotox, vlbw, PDM, actg, METABRIC, AIDS, and hdfail\. In all cases, the failure was caused by a singular Hessian matrix of the Cox partial log\-likelihood, which prevented the Newton\-Raphson optimizer from computing a valid update\.222This issue can often be mitigated by adding regularization\. However, because we include CoxNet as the regularized Cox baseline, we intentionally keep CoxPH as the conventional unregularized Cox model\.
- •DeepSurvfailed on11dataset, PDM\. The fitted baseline hazard produced a degenerate survival curve equal to11at all time points, resulting in identical predictions for all test instances\.
- •CoxTimefailed on11dataset, FRTCS, because the predicted survival matrix contained NaN values\.
- •IWSGfailed on55datasets: SUPPORT, MIMIC\-IV\_all, hdfail, SEER\_brain, and SEER\_liver\. In all cases, the model did not complete training within the 72\-hour wall\-clock limit\.
- •DSMfailed on33datasets: FRTCS, NPC, and MIMIC\-IV\_all\. For FRTCS and NPC, the predicted survival matrix contained NaN values\. For MIMIC\-IV\_all, the run did not complete within the7272\-hour wall\-clock limit\.
- •SurvivalMDNfailed on33datasets: SEER\_liver, SEER\_brain, and hdfail\. On hdfail, the run exceeded the available4848GB GPU memory\. On SEER\_liver and SEER\_brain, the model did not complete within the7272\-hour wall\-clock limit\.
- •DeepAFT\-Weibullfailed on11dataset, FRTCS, because the model did not converge during training\.
- •DeepAFT\-Loglogisticfailed on11dataset, FRTCS, because the model did not converge during training\.
Letrankm,𝒟,q\\operatorname\{rank\}\_\{m,\\mathcal\{D\},q\}denote the rank of modelmmon dataset𝒟\\mathcal\{D\}for metricqq\. For each model and metric, we summarize performance by the median rank across datasets,
r~m,q=median𝒟rankm,𝒟,q\.\\widetilde\{r\}\_\{m,q\}\\ =\\ \\operatorname\{median\}\_\{\\mathcal\{D\}\}\\operatorname\{rank\}\_\{m,\\mathcal\{D\},q\}\.To quantify uncertainty in this median rank, we use a nonparametric bootstrap over datasets\. Specifically, for each bootstrap replicateb=1,…,Bb=1,\\ldots,B, we sample datasets with replacement from the benchmark collection, recompute the median rank,
r~m,q\(b\)=median𝒟∈ℬ\(b\)rankm,𝒟,q,\\widetilde\{r\}^\{\(b\)\}\_\{m,q\}\\ =\\ \\operatorname\{median\}\_\{\\mathcal\{D\}\\in\\mathcal\{B\}^\{\(b\)\}\}\\operatorname\{rank\}\_\{m,\\mathcal\{D\},q\},and report the 95% bootstrap confidence interval as
\[Q0\.025\(\{r~m,q\(b\)\}b=1B\),Q0\.975\(\{r~m,q\(b\)\}b=1B\)\]\.\\left\[Q\_\{0\.025\}\\\!\\left\(\\\{\\widetilde\{r\}^\{\(b\)\}\_\{m,q\}\\\}\_\{b=1\}^\{B\}\\right\),Q\_\{0\.975\}\\\!\\left\(\\\{\\widetilde\{r\}^\{\(b\)\}\_\{m,q\}\\\}\_\{b=1\}^\{B\}\\right\)\\right\]\.The overall rank is computed by pooling ranks over all evaluation metrics before applying the same median\-rank and bootstrap procedure\.
## Appendix FBenchmark Dataset Details
We construct a large collection of real\-world survival datasets from two sources\. First, we use datasets from theSurvSetpackage\[[15](https://arxiv.org/html/2605.15488#bib.bib110)\]\. We exclude datasets that are longitudinal, contain competing risks rather than a single event of interest, or are high\-dimensional in the sense that the number of features exceeds the number of samples\. After applying these criteria, 57SurvSetdatasets remain\. Second, we collect 24 additional survival datasets from textbooks, recent papers, and public software packages, restricting attention to datasets with at most 100,000 samples\. This yields a total of 81 real\-world datasets, which, to our knowledge, constitutes one of the largest survival\-analysis benchmark collections considered in a single study\. Summary statistics for all datasets are provided in TableLABEL:tab:dataset\_summary\.
Among the 81 datasets, we designate 20 datasets as validation datasets for selecting the SurvivalPFN checkpoint during training\.333SurvivalPFN is not trained or fine\-tuned on these validation datasets, nor on any other real\-world dataset; they are used only for checkpoint selection\.The remaining 61 datasets are held out for the final benchmark and are not used for checkpoint selection\. For each validation dataset, we use a deterministic split with 70% of samples as the in\-context training set and 30% as the inference set\. At each training epoch, the current SurvivalPFN checkpoint is evaluated on all 20 validation datasets\. We select the checkpoint with the best weighted average integrated Brier score,
Weighted−IBS\(θ\)\\displaystyle\\operatorname\{Weighted\-IBS\}\(\\theta\)\\=∑𝒟NIBS𝒟\(θ\)∑𝒟N\\displaystyle=\\ \\frac\{\\sum\_\{\\mathcal\{D\}\}\\sqrt\{N\}\\,\\operatorname\{IBS\}\_\{\\mathcal\{D\}\}\(\\theta\)\}\{\\sum\_\{\\mathcal\{D\}\}\\sqrt\{N\}\}θ∗\\displaystyle\\theta^\{\\ast\}\\=argminθWeighted−IBS\(θ\),\\displaystyle=\\ \\arg\\min\_\{\\theta\}\\operatorname\{Weighted\-IBS\}\(\\theta\),whereIBS𝒟\(θ\)\\operatorname\{IBS\}\_\{\\mathcal\{D\}\}\(\\theta\)is the IBS of checkpointθ\\thetaon the inference split of dataset𝒟\\mathcal\{D\}, calculated using Equation[E\.2](https://arxiv.org/html/2605.15488#A5.E2)\. The square\-root weighting gives larger datasets more influence while avoiding domination by the largest cohorts\.
The validation datasets were chosen to cover a broad range of empirical regimes\. In particular, we considered sample size, censoring rate, and the tail survival probability estimated by the Kaplan\-Meier curve,S^KM\(tmax\)\\widehat\{S\}\_\{\\mathrm\{KM\}\}\(t\_\{\\max\}\)\.
Table 4:Summary of survival datasets used for checkpoint selection and held\-out benchmarking\. Features \(cat\. feat\.\) reports the total number of covariates, with the number of categorical covariates in parentheses\.DatasetNumber ofsamplesFeatures\(cat\. feat\.\)MissingfeaturesMissingrateCensoringrateS^\(tmax\)\\widehat\{S\}\(t\_\{\\max\}\)Validation datasets for checkpoint selectionovarian264 \(4\)00\.0%53\.8%49\.7%glioma374 \(4\)00\.0%37\.8%34\.9%leukemia423 \(2\)00\.0%28\.6%18\.9%pharmacoSmoking12516 \(16\)00\.0%28\.8%28\.8%d\.oropha\.rec19224 \(24\)00\.0%27\.6%16\.5%Pbc334919 \(19\)40\.4%82\.5%63\.4%retinopathy39411 \(11\)00\.0%60\.7%53\.1%Rossi4327 \(5\)00\.0%73\.6%73\.6%phpl04K8a44221 \(21\)00\.0%46\.6%0\.0%prostate50225 \(25\)40\.2%29\.5%23\.8%uis62814 \(14\)30\.6%19\.1%16\.6%grace1,0005 \(5\)00\.0%67\.6%63\.5%rdata1,0406 \(6\)00\.0%47\.4%38\.1%TRACE1,8786 \(6\)00\.0%49\.0%43\.9%Aids22,83912 \(12\)00\.0%38\.0%5\.8%UnempDur3,2416 \(6\)00\.0%38\.7%10\.5%smarto3,87334 \(34\)164\.8%88\.1%72\.5%dataDIVAT15,94316 \(16\)00\.0%83\.6%0\.0%oldmort6,49514 \(14\)00\.0%69\.7%0\.0%prostateSurvival14,2946 \(6\)00\.0%94\.4%81\.4%Held\-out test datasets for benchmarkingBergamaschi8210 \(10\)00\.0%65\.9%58\.3%larynx904 \(3\)00\.0%44\.4%29\.7%lupus955 \(2\)10\.2%70\.5%36\.8%micro\.censure11781 \(81\)00\.0%77\.8%42\.4%cgd12823 \(23\)00\.0%65\.6%47\.6%veteran1378 \(8\)00\.0%6\.6%0\.0%nki7014476 \(76\)00\.0%66\.7%48\.0%stagec14618 \(18\)20\.4%63\.0%55\.8%burn15413 \(13\)00\.0%68\.8%48\.0%BMT18739 \(24\)111\.1%54\.5%53\.3%WPBC19832 \(1\)10\.1%76\.3%65\.3%Melanoma2055 \(5\)00\.0%72\.2%64\.5%hepatoCellular22743 \(43\)2633\.6%57\.3%51\.2%cancer22828 \(28\)41\.0%27\.6%5\.0%NCCTG2288 \(1\)63\.7%27\.6%5\.0%mgus24112 \(12\)48\.7%6\.6%1\.8%e16842843 \(3\)00\.0%31\.0%28\.3%HFCR29911 \(5\)00\.0%67\.9%57\.6%ova35811 \(11\)00\.0%25\.7%22\.2%diabetes3944 \(4\)00\.0%60\.7%53\.0%PBC41817 \(7\)1214\.5%61\.5%35\.3%zinc43120 \(20\)00\.0%81\.2%79\.0%Unemployment4527 \(7\)00\.0%43\.4%5\.8%whas50046117 \(17\)00\.0%61\.8%0\.0%cost51813 \(13\)00\.0%22\.0%22\.0%BCCardiotox53125 \(15\)225\.7%89\.8%69\.0%GBM59110 \(7\)34\.5%17\.1%0\.0%vlbw61741 \(41\)51\.7%82\.7%0\.0%GBSG26868 \(2\)00\.0%56\.4%34\.3%FRTCS69713 \(13\)20\.1%89\.7%57\.2%kidney\_transplant8634 \(3\)00\.0%83\.8%72\.4%dataOvarian1912162 \(162\)00\.0%40\.4%11\.9%colon92912 \(12\)10\.2%51\.3%45\.5%credit1,00031 \(20\)10\.6%30\.0%5\.9%PDM1,0008 \(5\)00\.0%60\.3%0\.0%LeukSurv1,04329 \(29\)00\.0%15\.7%4\.4%actg1,15117 \(17\)00\.0%91\.7%90\.1%COVID1,4229 \(2\)00\.0%90\.5%1\.1%WHAS1,6386 \(4\)00\.0%57\.9%35\.7%dataDIVAT21,8374 \(4\)00\.0%68\.3%31\.1%scania1,9318 \(8\)00\.0%43\.8%15\.9%churn1,95819 \(10\)00\.0%52\.4%24\.4%METABRIC1,98179 \(73\)00\.0%55\.2%11\.6%AIDS2,13922 \(14\)00\.0%24\.4%0\.0%NACD2,39648 \(33\)00\.0%36\.4%12\.5%rott22,98212 \(12\)00\.0%57\.3%26\.5%divorce3,3714 \(4\)00\.0%69\.4%55\.7%acath3,5043 \(3\)111\.9%33\.4%0\.0%NWTCO4,0286 \(5\)00\.0%85\.8%84\.9%dataDIVAT34,26716 \(16\)00\.0%94\.4%84\.7%Framingham4,69917 \(17\)20\.1%68\.7%60\.5%NPC6,4499 \(3\)00\.0%80\.8%75\.6%Dialysis6,80572 \(72\)00\.0%76\.4%58\.7%FLCHAIN7,87123 \(2\)10\.7%72\.5%68\.2%MSKCC8,130206 \(199\)20\.0%70\.3%34\.8%SUPPORT9,10531 \(11\)104\.0%31\.9%24\.1%employee11,99116 \(9\)00\.0%83\.4%50\.8%MIMIC\-IV\_all38,52091 \(6\)00\.0%66\.7%0\.0%hdfail52,42288 \(88\)00\.0%94\.5%56\.1%SEER\_brain73,70310 \(4\)00\.0%40\.1%26\.6%SEER\_liver82,84114 \(1\)00\.0%37\.6%18\.0%We briefly describe below the 24 datasets that are not taken fromSurvSet\. For theSurvSetdatasets, please refer toDrysdale \[[15](https://arxiv.org/html/2605.15488#bib.bib110)\]\.
- •leukemia: Theleukemiadataset records remission survival for 42 patients with sex, treatment assignment, and log white blood cell count\[[50](https://arxiv.org/html/2605.15488#bib.bib136)\]\.
- •larynx: Thelarynxcancer dataset follows 90 male patients from first treatment until death or study end, with disease stage, age, and diagnosis year\[[44](https://arxiv.org/html/2605.15488#bib.bib137)\]\.
- •lupus: Thelupusdataset records survival times and diagnostic features for systemic lupus erythematosus patients\[[63](https://arxiv.org/html/2605.15488#bib.bib138)\]\.
- •BMT: The Bone Marrow Transplant \(BMT\) Children dataset describes pediatric patients with malignant and nonmalignant hematologic diseases who underwent unrelated\-donor allogeneic hematopoietic stem cell transplantation\[[93](https://arxiv.org/html/2605.15488#bib.bib139)\]\.
- •NCCTG: TheNCCTGlung cancer dataset follows advanced lung cancer patients with physician\-rated and patient\-rated performance scores\[[58](https://arxiv.org/html/2605.15488#bib.bib140)\]\.
- •HFCR: The Heart Failure Clinical Records \(HFCR\) dataset contains follow\-up outcomes and clinical measurements for 299 heart failure patients\[[7](https://arxiv.org/html/2605.15488#bib.bib141)\]\.
- •Rossi: TheRossidataset follows 432 Maryland prison releasees for one year to study recidivism after financial\-aid treatment assignment\[[90](https://arxiv.org/html/2605.15488#bib.bib142)\]\.
- •BCCardiotox:BCCardiotoxfollows HER2\+ breast cancer patients treated with potentially cardiotoxic therapies and records time to cancer therapy\-related cardiac dysfunction\[[76](https://arxiv.org/html/2605.15488#bib.bib143)\]\.
- •GBM: The TCGA glioblastoma multiforme \(GBM\) dataset contains clinical survival information for primary brain tumor patients\[[95](https://arxiv.org/html/2605.15488#bib.bib144)\]\.
- •kidney\_transplant: Thekidney\_transplantdataset records post\-transplant death times with recipient age, gender, and race\[[49](https://arxiv.org/html/2605.15488#bib.bib145)\]\.
- •credit: ThePySurvival\[[18](https://arxiv.org/html/2605.15488#bib.bib146)\]credit\-risk dataset adapts the German Credit data to model the time until a loan is fully repaid\.
- •PDM: ThePySurvival\[[18](https://arxiv.org/html/2605.15488#bib.bib146)\]predictive\-maintenance \(PDM\) dataset models the time until an industrial machine breaks\.
- •COVID: The COVID\-19 Asian discharge dataset models time to hospital discharge for patients with COVID\-19 diagnosis\[[52](https://arxiv.org/html/2605.15488#bib.bib4)\]\.
- •WHAS: The Worcester Heart Attack Study \(WHAS\) follows acute myocardial infarction patients after hospital admission\[[39](https://arxiv.org/html/2605.15488#bib.bib147)\]\.
- •churn: ThePySurvival\[[18](https://arxiv.org/html/2605.15488#bib.bib146)\]churn dataset models when SaaS customers stop their monthly subscription\.
- •METABRIC:METABRICprofiles primary breast tumors with long\-term clinical follow\-up for breast\-cancer prognosis\[[12](https://arxiv.org/html/2605.15488#bib.bib62)\]\.
- •AIDS: ACTG 175 records HIV\-infected adults randomized to nucleoside monotherapy or combination therapy\[[30](https://arxiv.org/html/2605.15488#bib.bib148)\]\.
- •NACD: The Northern Alberta Cancer Dataset contains clinical records for several cancer sites, which used to predict the death from cancer onset\[[103](https://arxiv.org/html/2605.15488#bib.bib82)\]\.
- •NPC: The nasopharyngeal carcinoma dataset follows patients from Sun Yat\-sen University Cancer Center for progression\-free survival after radiotherapy with or without chemotherapy\[[96](https://arxiv.org/html/2605.15488#bib.bib149)\]\. Original training and validation cohorts are combined\.
- •MSKCC: The MSK\-IMPACT cohort contains targeted sequencing and clinical annotations from more than 10,000 advanced cancer cases\[[104](https://arxiv.org/html/2605.15488#bib.bib150)\]\.
- •MIMIC\-IV\_all:MIMIC\-IV\_allis derived from the MIMIC\-IV critical care database, which contains de\-identified electronic health records for patients admitted to intensive care units\[[42](https://arxiv.org/html/2605.15488#bib.bib152)\]\. We use the all\-cause mortality cohort curated byQiet al\.\[[79](https://arxiv.org/html/2605.15488#bib.bib133)\], restricting to patients who survived at least 24 hours after ICU admission\.
- •employee: ThePySurvival\[[18](https://arxiv.org/html/2605.15488#bib.bib146)\]employee\-retention dataset models when employees leave a company using HR and workload attributes\.
- •SEER\_brain:SEER\_brainis a brain cancer cohort derived from the Surveillance, Epidemiology, and End Results \(SEER\) Program registry\. The task is to predict time from cancer diagnosis to death or censoring, using the cohort curated byQiet al\.\[[83](https://arxiv.org/html/2605.15488#bib.bib151)\]\.
- •SEER\_liver:SEER\_liveris the corresponding liver cancer cohort from SEER, with the same time\-to\-death prediction target and curation protocol\[[83](https://arxiv.org/html/2605.15488#bib.bib151)\]\.
## Appendix GAdditional Experimental Results
### G\.1RQ1: Predictive Performance
Figure[6](https://arxiv.org/html/2605.15488#S4.F6)has demonstrate the overall performance across 61 benchmark datasets\. The intervals reflect uncertainty in the estimated median rank across the benchmark collection, rather than the variability of ranks themselves\. In this appendix, we further analyze the benchmark results by stratifying datasets according to sample size and censoring rate\.
The sample\-size stratification reveals that SurvivalPFN is particularly effective in the small\-data regime\. On small datasets \(Figure[11](https://arxiv.org/html/2605.15488#A7.F11)\), SurvivalPFN is clearly separated from most baselines in overall rank and performs strongly across all five metrics, including IBS, CI, D\-calibration, MAE, and Log\-Rank\. Traditional statistical survival methods and tree\-based methods performance in the second tier, while neural\-network\-based methods generally performance the worst\.
Figure 11:Model ranks across 24 small\-size datasets \(N<500N<500\)\.Points/stars denote median ranks across datasets, with horizontal bars showing 95% bootstrap confidence intervals for the median rank\.For medium\-sized datasets \(Figure[12](https://arxiv.org/html/2605.15488#A7.F12)\), SurvivalPFN remains among the strongest methods overall, with especially competitive performance on IBS, Log\-Rank\. However, the gap to the best baselines becomes smaller, and several nerual\-network\-based methods \(e\.g\., MTLR, SurvivalMDN\) become competitive on the overall performance and also on individual metrics such as CI or D\-calibration\.
On large datasets \(Figure[13](https://arxiv.org/html/2605.15488#A7.F13)\), the advantage of SurvivalPFN decreases further: its overall, IBS, and CI ranks move leftward compared with the small\-data setting, although it remains competitive on MAE and Log\-Rank\. This trend suggests a size\-regime trade\-off\. As more observations become available, deep\-learning\-based estimators can benefit more directly from the larger sample size, whereas SurvivalPFN is constrained by finite context length and the need to summarize large tables during inference\.
Figure 12:Model ranks across 27 medium\-size datasets \(500≤N<5000500\\leq N<5000\)\.Points/stars denote median ranks across datasets, with horizontal bars showing 95% bootstrap confidence intervals for the median rank\.Figure 13:Model ranks across 10 medium\-size datasets \(N≥5000N\\geq 5000\)\.Points/stars denote median ranks across datasets, with horizontal bars showing 95% bootstrap confidence intervals for the median rank\.In contrast, the censoring\-rate stratification shows substantially more stable behavior\. Across low\-, medium\-, and high\-censoring subsets \(Figures[14](https://arxiv.org/html/2605.15488#A7.F14)\-[16](https://arxiv.org/html/2605.15488#A7.F16)\), SurvivalPFN remains in the top group overall and retains strong performance on IBS, MAE, and Log\-Rank\. This consistency is encouraging because censoring affects the available event\-time information and can differentially impact discrimination, calibration, and distributional accuracy metrics\. The results suggest that the synthetic prior used to train SurvivalPFN provides useful robustness across censoring regimes\. Although the strongest baseline varies across metrics and censoring levels, no single baseline matches SurvivalPFN’s overall consistency across dataset strata, censoring strata, and evaluation metrics\.
Figure 14:Model ranks across 12 low\-censoring\-rate datasets \(censoring rate<33%<33\\%\)\.Points/stars denote median ranks across datasets, with horizontal bars showing 95% bootstrap confidence intervals for the median rank\.Figure 15:Model ranks across 25 medium\-censoring\-rate datasets \(censoring rate≥33%\\geq 33\\%and<67%<67\\%\)\.Points/stars denote median ranks across datasets, with horizontal bars showing 95% bootstrap confidence intervals for the median rank\.Figure 16:Model ranks across 24 high\-censoring\-rate datasets \(censoring rate≥67%\\geq 67\\%\)\.Points/stars denote median ranks across datasets, with horizontal bars showing 95% bootstrap confidence intervals for the median rank\.
### G\.2RQ2: Computational Efficiency
For the runtime comparison in Figure[1](https://arxiv.org/html/2605.15488#S1.F1), datasets with at least one failed run are excluded from this runtime aggregation, so every method is compared on the same set of completed datasets\. This prevents methods from appearing artificially faster because they crashed, timed out, or failed to produce valid predictions on more demanding datasets\. Predictive\-performance rankings still follow the failure handling protocol in Appendix[E\.6](https://arxiv.org/html/2605.15488#A5.SS6)\.
### G\.3RQ3: Sensitivity to Training\-Set Size
Figure[17](https://arxiv.org/html/2605.15488#A7.F17)extends the training\-ratio analysis to 16 additional datasets\. Across these datasets, increasing the training ratio improves all methods, with IBS decreasing and CI increasing most sharply when moving from very small context sizes to moderate context sizes\. SurvivalPFN remains stable across ratios and is frequently among the best methods on both metrics, especially in low\-data regimes\.
\(\(a\)\)larynx
\(\(b\)\)nki70
\(\(c\)\)WPBC
\(\(d\)\)hepatoCellular
\(\(e\)\)NCCTG
\(\(f\)\)ova
\(\(g\)\)diabetes
\(\(h\)\)zinc
\(\(i\)\)GBM
\(\(j\)\)dataDIVAT2
\(\(k\)\)churn
\(\(l\)\)acath
Figure 17:Sensitivity to the training/context ratio across selected 16 datasets\.
### G\.4RQ4: Compare with General Tabular Foundational Models
This section compares the SurvivalPFN, StaticSurvialTFM with other general TFMs\. Specifically, we includes the most advanced TFMs including: TabPFN v2\.5\[[27](https://arxiv.org/html/2605.15488#bib.bib158)\], TabICL v2\[[84](https://arxiv.org/html/2605.15488#bib.bib154)\], MITRA\[[106](https://arxiv.org/html/2605.15488#bib.bib159)\], and TabDPT\[[61](https://arxiv.org/html/2605.15488#bib.bib155)\]regressors\.
For the these TFM regression baselines, we train only on uncensored training examples because these models cannot natively deal with right\-censoring datasets\. TabPFN and TabICL are used as quantile regressors: they predict event\-time quantiles, which are monotonized and converted into survival curves for distributional evaluation\. MITRA and TabDPT are used as point regressors: they predict a single event time per test subject \(just like cSVR\)\. Since point predictions do not define a full survival distribution, distributional metrics such as IBS and D\-calibration are not calculated and ranked as the worst among all the methods, while CI, MAE, and log\-rank style comparisons are computed from the predicted event times\.
For StaticSurvialTFM\[[46](https://arxiv.org/html/2605.15488#bib.bib130)\], it is a static fomula that can convert any classifier to survival predictor\. We instantiate this static formulation with TabDPT and MITRA classifier backbones, predict failure probabilities over the cutoff grid, convert them to survival probabilities, and enforce monotone survival curves\. We choose TabDPT to match with the model architecture of SurvivalPFN \(for a fair comparison\)\. We include MITRA as it is the best performing backbone described inKimet al\.\[[46](https://arxiv.org/html/2605.15488#bib.bib130)\]\.
The results present here uses the same evaluation protocal as described in Appendix[G\.1](https://arxiv.org/html/2605.15488#A7.SS1)\. Figure[18](https://arxiv.org/html/2605.15488#A7.F18)expands the comparison in Figure[8](https://arxiv.org/html/2605.15488#S4.F8)by including both general TFMs and the StaticSurvivalTFM wrapper instantiated with TabDPT and MITRA\. Overall, SurvivalPFN remains the strongest and most consistent method: it achieves the best aggregate rank and ranks first or near\-first across nearly all metrics\. The largest gains appear for IBS and D\-calibration, where SurvivalPFN clearly outperforms both direct TFM baselines and StaticSurvivalTFM variants, indicating better probabilistic survival estimation and calibration\. SurvivalPFN also performs best on CI and log\-rank, showing that its advantage extends beyond distributional accuracy to risk ranking and group separation\.
The StaticSurvivalTFM performance is really sensitive to the backbone TFM model – which aligns the findings in\[[46](https://arxiv.org/html/2605.15488#bib.bib130)\]\. Using MITRA as the backbone improve over using TabDPT, especially for CI and log\-rank, confirming that survival\-specific label construction is helpful\. However, their performance is less stable across metrics: StaticSurvivalTFM \(MITRA\) is competitive on MAE, CI, but does not match SurvivalPFN on IBS, D\-calibration and Log\-rank; StaticSurvivalTFM \(TabDPT\) performs well on log\-rank but is weaker on other\. In contrast, the direct regression baselines – TabPFN, TabICL, MITRA, and TabDPT – are consistently worse, despite being strong general tabular predictors\.
These results support the main conclusion ofRQ4: survival prediction benefits from a foundation model trained with survival\-specific supervision and censoring\-aware synthetic tasks, rather than relying only on generic tabular in\-context learning\.
Figure 18:Compare SurvivalPFN with general TFMs across 61 benchmark datasets\.Plotting conventions follow Figure[6](https://arxiv.org/html/2605.15488#S4.F6)\.
### G\.5RQ5: Ablation Studies
Table 5:SurvivalPFN ablation configurations\.Each row corresponds to one pretrained SurvivalPFN variant\. The internal checkpoint\-path column from the experiment log is omitted\. “Surv\.\-dist\.” denotes the survival\-distribution prior\. NLL denotes the one\-hot discrete negative log\-likelihood over transformed\-time bins; CE denotes the smoothed histogram cross\-entropy loss\.ModelPredictivePretrainPriorTimeTransformationLossQueryScheduleVariableTrain Ratiov01†✓Surv\.\-dist\.lognormal2normalNLLBoth✗v02✓Surv\.\-dist\.time2quantileNLLBoth✗v03✗Surv\.\-dist\.time2quantileCERandom✓v04✓Surv\.\-dist\.time2quantileCERandom✓v05✓Surv\.\-dist\.time2quantileNLLRandom✓v06✓Surv\.\-dist\.lognormal2normalNLLRandom✓v07✓Kitchen\-sinklognormal2normalCERandom✓v08✓Kitchen\-sinklognormal2normalNLLEvent\-only✓v09✓Surv\.\-dist\.lognormal2normalCERandom✓v10✓Surv\.\-dist\.lognormal2normalNLLEvent\-only✓v11✓Naivelognormal2normalCERandom✓v12✓Naivelognormal2normalNLLEvent\-only✓v13✗Naivelognormal2normalCERandom✓v14✗Naivelognormal2normalNLLEvent\-only✓
†Best validation run; this checkpoint is used as the default SurvivalPFN model elsewhere in the paper\.
#### SurvivalPFN Ablation Configurations\.
Table[5](https://arxiv.org/html/2605.15488#A7.T5)summarizes the SurvivalPFN variants used in the ablation study\. Each row corresponds to one pretrained checkpoint and is defined by six configuration choices\.
Predictive Pretrainingspecifies whether the model is initialized from the predictive PFN\-style pretraining phase before survival\-specific training\.✓means that the model first undergoes the general predictive pretraining stage described in Appendix[D\.2](https://arxiv.org/html/2605.15488#A4.SS2), and is then further trained with the survival phase\. A value of✗means that survival\-phase training starts without this predictive initialization\. This ablation tests whether generic PFN\-style in\-context predictive pretraining improves downstream survival prediction\.
Priorspecifies the synthetic survival prior used to generate the right\-censored pretraining tasks, as described in Appendix[D\.1](https://arxiv.org/html/2605.15488#A4.SS1)\. We consider four possible prior families – the*naive prior*, the*survival\-distribution prior*, the*mixture prior*and the*kitchen\-sink prior*\.
Time transformationspecifies the monotone transformation applied to event and censoring times before discretization, as described in Appendix[D\.3](https://arxiv.org/html/2605.15488#A4.SS3)\. We try thelognormal2normaland thetime2quantiletransformations\.
Lossspecifies how the discretized predictive distribution is trained in transformed\-time space, as described in Appendix[D\.4](https://arxiv.org/html/2605.15488#A4.SS4)\. The*NLL*setting uses a one\-hot discrete negative log\-likelihood, assigning all target mass to the bin containing the latent target time\. The*CE*setting uses the smoothed histogram cross\-entropy loss, where the latent target time is converted into a narrow Gaussian\-smoothed histogram over bins\.
Query schedulespecifies how the query indicatorδ~∗\\widetilde\{\\delta\}^\{\\ast\}is selected during training, following Section[3](https://arxiv.org/html/2605.15488#S3)\. We try the*event\-only*,*both*, and*random*strategies\.
Variable train ratiospecifies whether the ratio between context/training samples and query/inference samples is varied during synthetic pretraining\. A value of✓means that this ratio is randomized across synthetic tasks, exposing the model to different amounts of context information and encouraging robustness to varying downstream train/test splits\. A value of✗means that the ratio is fixed at 70%/30% during training\.
Together, these choices define a full configuration space of2×4×2×2×3×2=1922\\times 4\\times 2\\times 2\\times 3\\times 2=192possible variants \(corresponding respectively to predictive pretraining, prior family, time transformation, loss, query schedule, and variable train ratio\)\. Exhaustively training all variants would be computationally expensive, so we selectively evaluate the representative configurations in Table[5](https://arxiv.org/html/2605.15488#A7.T5)\. The model marked with†\\daggeris the best validation run and is used as the default SurvivalPFN checkpoint elsewhere in the paper\.
Figure 19:Ablation study over SurvivalPFN training configurations\. Each row corresponds to one pretrained SurvivalPFN variant, with v01 marked as the selected best validation run and used as the default checkpoint elsewhere in the paper\. Plotting conventions follow Figure[6](https://arxiv.org/html/2605.15488#S4.F6)\.
#### Ablation results\.
Figure[19](https://arxiv.org/html/2605.15488#A7.F19)summarizes the rank of each SurvivalPFN configuration\. The selected checkpoint, v01, achieves the strongest overall behavior: it attains the best median rank across metrics among the evaluated configurations, and is particularly strong on distributional metrics, ranking best on IBS and D\-calibration\. This configuration uses predictive pretraining, the survival\-distribution prior, thelognormal2normaltransformation, the one\-hot NLL objective, the Both query schedule, and a fixed train/query ratio\. Its strong IBS and D\-calibration performance suggests that this combination is especially effective for learning calibrated posterior predictive survival distributions, which is the primary target of SurvivalPFN\.
Several trends emerge from the ablation\. First, the survival\-distribution prior is consistently stronger than the naive prior under otherwise similar settings\. This suggests that directly modeling flexible positive\-time distributions provides a more useful synthetic pretraining signal than treating generic tabular outputs as raw survival times\. The kitchen\-sink prior performs competitively but does not clearly dominate the survival\-distribution prior, this might indicating that simply increasing prior diversity is not sufficient; the match between the prior family and the survival\-prediction target also matters\.
Second, thelognormal2normaltransformation is preferred in this set of experiments\. The clearest comparison is between v01 and v02, replacinglognormal2normalwithtime2quantilesubstantially worsens the overall rank and degrades all five metric\-specific ranks\. This pattern suggests that the smooth positive\-time coordinate and tail extrapolation provided bylognormal2normalare useful for transferring across heterogeneous real\-world time scales, whereas the empirical quantile transform may lose information about tail behavior\.
Finally, predictive pretraining is generally helpful but not uniformly decisive in this limited ablation set\. Similarly, varying the train/query ratio does not show a clear monotonic benefit in the evaluated subset\.
## Appendix HConcurrent Works on Tabular Foundation Models for Survival Analysis
We discuss two concurrent approaches that adapt TFMs to right\-censored survival prediction\.
#### Classification\-Based Framework with Off\-the\-Shelf TFMs\.
Kimet al\.\[[46](https://arxiv.org/html/2605.15488#bib.bib130)\]propose a conversion from survival analysis to binary classification, allowing existing TFMs to be used without survival\-specific pretraining\. Given predefined discretization points
0=t0<t1<⋯<tK−1,0\\ =\\ t\_\{0\}\\ <\\ t\_\{1\}\\ <\\ \\cdots\\ <\\ t\_\{K\-1\},they define time\-indexed binary labels
Yi,k=1\(Ti≤tk\),Y\_\{i,k\}\\ =\\ \\mathbbm\{1\}\(T\_\{i\}\\leq t\_\{k\}\),so that each original tuple\(xi,ti,δi\)\(x\_\{i\},t\_\{i\},\\delta\_\{i\}\)is expanded into multiple classification examples indexed bykk\. Under right censoring, labels after the censoring time are treated as missing; equivalently, their binary cross\-entropy objective is evaluated only whentk<Cit\_\{k\}<C\_\{i\}\. Under conditional independent censoring and positivity, they show that minimizing the population binary cross\-entropy loss recovers the true failure probabilities,
p\(x,tk\)=Pr\(T≤tk∣X=x\),p\(x,t\_\{k\}\)\\ =\\ \\Pr\(T\\leq t\_\{k\}\\mid X=x\),and hence the survival probabilitiesS\(tk∣x\)=1−p\(x,tk\)S\(t\_\{k\}\\mid x\)=1\-p\(x,t\_\{k\}\)\. This formulation is attractive because it can immediately use strong off\-the\-shelf TFMs such as MITRA, TabPFN v2\.5, and TabICL v2\[[106](https://arxiv.org/html/2605.15488#bib.bib159),[27](https://arxiv.org/html/2605.15488#bib.bib158),[84](https://arxiv.org/html/2605.15488#bib.bib154)\], without retraining a survival\-specific model\.
The main limitation is that this reduction increases the effective context size\. Each subject produces up toK−1K\-1time\-indexed classification examples, so a dataset withNNsubjects becomes an expanded context of orderN\(K−1\)N\(K\-1\)\. This is manageable for small datasets, but can exceed the input limits of current TFMs on medium or large survival datasets, requiring subsampling and potentially discarding observed survival information\. The method also inherits limitations of discrete\-time classification, including dependence on the time grid and the need for post\-hoc monotonicity correction of predicted survival curves\. In our experiments, we include the static version of this approach as StaticSurvivalTFM\.
#### Survival\-Specific Prior\-Fitted In\-Context Learning\.
Seletkovet al\.\[[91](https://arxiv.org/html/2605.15488#bib.bib157)\]propose Survival In\-Context \(SIC\), a survival\-specific prior\-fitted model trained on synthetic right\-censored datasets\. Their data generator first samples covariates and latent risk variables\(η1,η2\)\(\\eta\_\{1\},\\eta\_\{2\}\)from structural causal models \(SCMs\)\. Event times are then generated using the extended\-hazard model
h\(t∣x\)=h0\(teη1\)eη2,h\(t\\mid x\)\\ =\\ h\_\{0\}\(te^\{\\eta\_\{1\}\}\)e^\{\\eta\_\{2\}\},which yields
T=e−η1H0−1\(eη1−η2\(−logU\)\),U∼Unif\(0,1\),T\\ =\\ e^\{\-\\eta\_\{1\}\}H\_\{0\}^\{\-1\}\\\!\\left\(e^\{\\eta\_\{1\}\-\\eta\_\{2\}\}\(\-\\log U\)\\right\),\\qquad U\\ \\sim\\ \\mathrm\{Unif\}\(0,1\),whereH0−1H\_\{0\}^\{\-1\}is chosen from a set of parametric baseline families such as Weibull, lognormal, log\-logistic, Gompertz, and Birnbaum\-Saunders distributions\. Censoring is generated by random censoring assumption – not dependent on covariates and event times\.
Architecturally, SIC builds on TabICL, adds a time\-event embedding, and uses a DeepHit\-style discrete\-time survival headLeeet al\.\[[55](https://arxiv.org/html/2605.15488#bib.bib30)\]trained with a likelihood\-plus\-ranking loss\.
SIC is closely related to SurvivalPFN in that both methods pretrain an in\-context model specifically for survival prediction\. However, the two approaches differ substantially in prior design, which is central to the PFN paradigm\. SIC’s prior is based on a parametric extended\-hazard construction and only random censoring, whereas SurvivalPFN uses a broader family of identifiable right\-censored DGPs, including random censoring and covariate\-dependent censoring mechanisms satisfying conditional independence\. SurvivalPFN also avoids committing to explicit parametric hazard or survival families, allowing the event and censoring distributions to be generated by more flexible stochastic neural mechanisms\. Finally, SurvivalPFN is accompanied by a posterior\-predictive consistency guarantee for identifiable survival priors, whereas SIC only provides empirical evidence\.
The empirical scope also differs: SIC evaluates on a smaller benchmark with one main metric and a limited set of baselines, while our study evaluates on 61 held\-out datasets, five metrics, and 21 baselines\. Since SIC has not released public model weights or code, we cannot include it in our direct empirical comparison\.Similar Articles
Amortized Factor Inference Networks for Posterior Inference
Introduces Amortized Factor Inference Networks (AFINs), a family of encode-merge-decode inference networks that generalize across varying priors, likelihoods, and dimensionality, achieving posterior accuracy comparable to NUTS with much less compute.
Aligning Data-Driven Predictors with Allocation: A Decision-Focused Approach to Survival Analysis
This paper introduces a decision-focused learning approach for survival analysis that aligns predictive models with downstream allocation decisions, using NDCG optimization. Applied to US heart transplant data, it improves ranking performance by 50-100%, potentially yielding thousands of additional life-years annually.
TabPFN-MT: A Natively Multitask In-Context Learner for Tabular Data
TabPFN-MT extends PFNs to multitask in-context learning for tabular data, achieving state-of-the-art on small-to-medium datasets while reducing inference cost from O(T) to O(1) forward passes.
When Tabular Foundation Models Meet Strategic Tabular Data: A Prior Alignment Approach
This paper studies whether tabular foundation models based on pretrained prior-data fitted networks (PFNs) can generalize to strategic tabular data where individuals modify features after deployment. It proposes Strategic Prior-data Fitted Network (SPN), an inference-time framework that aligns PFN predictions with the post-manipulation distribution without retraining.
TabPFN-3: Technical Report
TabPFN-3 is a new foundation model for tabular data, pretrained on synthetic data, that scales to 1M training rows while reducing training and inference time, achieving state-of-the-art performance on tabular prediction, time series, and relational data.