Aligning Data-Driven Predictors with Allocation: A Decision-Focused Approach to Survival Analysis

arXiv cs.LG Papers

Summary

This paper introduces a decision-focused learning approach for survival analysis that aligns predictive models with downstream allocation decisions, using NDCG optimization. Applied to US heart transplant data, it improves ranking performance by 50-100%, potentially yielding thousands of additional life-years annually.

arXiv:2606.02671v1 Announce Type: new Abstract: Machine learning predictors have become essential tools for guiding automated decision making. However, a major misalignment persists: predictive models are typically optimized in terms of standard statistical metrics in isolation from the algorithmic tasks they inform. We highlight this incongruity in the high-stakes domain of organ allocation by demonstrating that any algorithm relying on (even highly accurate) survival predictors optimized for standard metrics -- such as the Concordance index (C-index) -- can yield arbitrarily poor outcomes when used for allocation, failing to guarantee utility better than a uniform random selection. To bridge the gap between survival analysis and policy optimization, we introduce a decision-focused learning approach based on optimizing normalized discounted cumulative gain (NDCG), a mainstay metric in information retrieval. We establish the utility of NDCG in survival analysis by proving that it translates to guarantees on the performance of allocation. Empirically, we propose a bootstrapping approach to optimize the NDCG of existing survival models. Unlike prior work, we also address the challenge of right censorship when evaluating ranking. On historical heart transplant data from the US, our method dramatically boosts the NDCG of baseline models by 50-100%, which translates to tens of thousands of additional life years gained annually when deployed for transplant allocation. We anticipate that our framework will find broader applications in decision making with predictions.
Original Article
View Cached Full Text

Cached at: 06/03/26, 09:39 AM

# A Decision-Focused Approach to Survival Analysis
Source: [https://arxiv.org/html/2606.02671](https://arxiv.org/html/2606.02671)
## Aligning Data\-Driven Predictors with Allocation: A Decision\-Focused Approach to Survival Analysis

Itai ZilbersteinCorrespondence toizilbers@cs\.cmu\.eduDepartment of Computer Science, Carnegie Mellon University, Pittsburgh, PATuomas SandholmDepartment of Computer Science, Carnegie Mellon University, Pittsburgh, PAAdditional affiliations: Strategy Robot, Inc\., Strategic Machine, Inc\., Optimized Markets, Inc\.

###### Abstract

Machine learning predictors have become essential tools for guiding automated decision making\. However, a major misalignment persists: predictive models are typically optimized in terms of standard statistical metrics in isolation from the algorithmic tasks they inform\. We highlight this incongruity in the high\-stakes domain of organ allocation by demonstrating that any algorithm relying on \(even highly accurate\) survival predictors optimized for standard metrics—such as theConcordance index \(C\-index\)—can yield arbitrarily poor outcomes when used for allocation, failing to guarantee utility better than a uniform random selection\. To bridge the gap between survival analysis and policy optimization, we introduce a*decision\-focused learning*approach based on optimizing*normalized discounted cumulative gain \(NDCG\)*, a mainstay metric in information retrieval\. We establish the utility of NDCG in survival analysis by proving that it translates to guarantees on the performance of allocation\. Empirically, we propose a bootstrapping approach to optimize the NDCG of existing survival models\. Unlike prior work, we also address the challenge of right censorship when evaluating ranking\. On historical heart transplant data from the US, our method dramatically boosts the NDCG of baseline models by 50\-100%, which translates to tens of thousands of additional life years gained annually when deployed for transplant allocation\. We anticipate that our framework will find broader applications in decision making with predictions\.

## 1Introduction

Real\-world decision making increasingly relies on algorithms powered bymachine learning \(ML\)predictors trained on vast amounts of historical data\. From resource allocation to automated planning and scheduling, these data\-driven systems are deployed in high\-stakes environments\. However, an underlying disconnect persists: the development of classical algorithms for these problems is often disjoint from the design of the predictive models they leverage\. ML models are typically optimized in isolation for standard statistical metrics, while the downstream algorithms using these predictions either fail to account for the predictor’s performance profile or suffer because the algorithmic objective is misaligned with the model’s training\. The gap between predictive accuracy and algorithmic utility can lead to catastrophic outcomes, particularly in high\-stakes applications such as organ allocation\.

Organ transplantation is the treatment of choice for many terminal illnesses\. Across organ types, the demand for deceased\-donor organs outpaces the available supply\[Cameliet al\.,[2022](https://arxiv.org/html/2606.02671#bib.bib53)\]\. In the US alone, thousands of patients with end\-stage heart failure are waitlisted for a life\-saving organ\.

The current US heart transplant allocation policy places patients into rigid hierarchical tiers and allocates the organ to the highest\-priority compatible patient\. The policy often treats patients with heterogeneous clinical profiles as effectively identical\. A major criticism of the policy is that it does not leverage finer\-grained predictions of pretransplant mortality and post\-transplant outcomes\[Shoreet al\.,[2020](https://arxiv.org/html/2606.02671#bib.bib41), Zhanget al\.,[2024](https://arxiv.org/html/2606.02671#bib.bib67)\]\. As a result, the US is transitioning to new data\-driven solutions to improve the efficiency of the heart transplantation system\[Papalexopouloset al\.,[2024](https://arxiv.org/html/2606.02671#bib.bib54)\]\. The allocations of other organs, such as lungs\[OPTN,[2025](https://arxiv.org/html/2606.02671#bib.bib39), Gottliebet al\.,[2017](https://arxiv.org/html/2606.02671#bib.bib111)\], livers\[Kamathet al\.,[2001](https://arxiv.org/html/2606.02671#bib.bib109), Allenet al\.,[2024](https://arxiv.org/html/2606.02671#bib.bib66)\], and kidneys\[Abrahamet al\.,[2007](https://arxiv.org/html/2606.02671#bib.bib91), Mayer and Persijn,[2006](https://arxiv.org/html/2606.02671#bib.bib110)\], already rely on such computational methods in the US and abroad\.

A common data\-driven approach to allocating organs relies on predictors of transplant outcomes, such as the expected life\-years gained from an operation\[Berrevoetset al\.,[2021](https://arxiv.org/html/2606.02671#bib.bib15),[2020](https://arxiv.org/html/2606.02671#bib.bib1), Zilbersteinet al\.,[2026b](https://arxiv.org/html/2606.02671#bib.bib99), Zhanget al\.,[2024](https://arxiv.org/html/2606.02671#bib.bib67)\]\. The field ofsurvival analysishas developed powerful statistical models for estimating such outcomes\[Cox,[1972](https://arxiv.org/html/2606.02671#bib.bib23), Katzmanet al\.,[2018](https://arxiv.org/html/2606.02671#bib.bib102), Leeet al\.,[2018](https://arxiv.org/html/2606.02671#bib.bib28), Wei,[1992](https://arxiv.org/html/2606.02671#bib.bib101), Nagpalet al\.,[2021](https://arxiv.org/html/2606.02671#bib.bib104)\]\. Yet, when these models are integrated into allocation mechanisms, the aforementioned disconnect surfaces\. Survival models are traditionally optimized for and evaluated on metrics such asConcordance index \(C\-index\)or average error, which measure aggregate performance across an entire dataset\. However, when a donor heart arrives, the matching algorithm does not need perfect point\-estimates of survival for all patients; rather, it requires a guarantee that it can identify the single best available match\.

#### Our contributions

As we will demonstrate, matching with a predictor that is optimized for C\-index can have arbitrarily bad outcomes\. We show that any deterministic algorithm relying on a predictor with near\-perfect C\-index can obtain a near\-zero fraction of the optimal utility \([Proposition˜1](https://arxiv.org/html/2606.02671#Thmproposition1)\)\. We then prove that no algorithm relying on a predictor with near\-perfect C\-index can guarantee more utility than random selection \([Proposition˜2](https://arxiv.org/html/2606.02671#Thmproposition2)\), showing that C\-index is a non\-informative measure for allocating even a single donor\. This failure is not just restricted to C\-index\. Most aggregate metrics, such as average error, can also lead to arbitrarily bad outcomes\.

WaitlistedpatientsP1P2⋮\\vdotsP9P101 yr, 1 yr,11 yrs2 yrs, 3 yrs,2 yrs9 yrs, 10 yrs,9 yrs10 yrs, 8 yrs,10 yrs

Figure 1:Illustration of heart transplant allocation with predicted outcomes\. The leftmost value shows the unknown, ground\-truth patient survival, the middlegreenvalue shows the predictions withNDCG​@​1=0\.9\\text\{NDCG\}@\{1\}=0\.9, and the rightmostredvalue shows the predictions with C\-index=0\.8=0\.8\.We take a step towards bridging the gap between predictive modeling in survival analysis and the requirements of downstream allocation policies\. While we focus on matching with predicted edge\-weights and predictors for survival analysis, our methods provide a template for evaluating and optimizing ML models whose primary purpose is to inform discrete allocation decisions\.

We begin by establishing a formal link between a predictor’sNDCG​@​k\\text\{NDCG\}@\{k\}and the utility guarantee of downstream allocation \([Theorem˜1](https://arxiv.org/html/2606.02671#Thmtheorem1)\)\. We prove that theNDCG​@​1\\text\{NDCG\}@\{1\}of a predictor translates to a provable guarantee on the utility of greedy allocation policies \([Corollary˜1](https://arxiv.org/html/2606.02671#Thmcorollary1)\), a property not shared by the C\-index\.[Figure˜1](https://arxiv.org/html/2606.02671#S1.F1)illustrates this discrepancy\.

We then introduce the use ofnormalized discounted cumulative gain \(NDCG\)\[Järvelin and Kekäläinen,[2002](https://arxiv.org/html/2606.02671#bib.bib106), Wanget al\.,[2013](https://arxiv.org/html/2606.02671#bib.bib105)\]for survival analysis\. NDCG cannot be directly applied to survival analysis due to right censorship: many data points are only represented by a lower bound on their true survival times because the patient is still alive or their follow\-up has ended\. We propose two novel estimators of NDCG for right censored data, and prove that both provide unbiased estimates of the truediscounted cumulative gain \(DCG\)\. We show how such estimators can be used to select the model with superior NDCG\.

Finally, we propose a method to bootstrap current survival predictors to optimize a model for NDCG\. We show using real historical heart transplant data that our estimators of NDCG can accurately identify model superiority and our bootstrapping method significantly improves the NDCG, roughly doubling theNDCG​@​1\\text\{NDCG\}@\{1\}from baseline models\. These gains are monumental: applying the increase translates to nearly 50,000 additional life years annually in the US alone\.111Assuming 4,000 annual transplants with a median graft survival of 12 years\[Colvinet al\.,[2025](https://arxiv.org/html/2606.02671#bib.bib112)\]\.

Our work exposes failures in the current design of high\-stakes decision\-making systems for organ allocation, and we offer theoretically grounded solutions to these failures\. We show it is unsafe to assume that better statistical prediction yields better policy outcomes, unveiling that current mechanisms, including those for allocating lungs, livers, and kidneys, are misaligned with their life\-saving objectives and cannot guarantee outcomes better than random selection\. Beyond transplantation, for ML to be safely deployed, its predictive components must be aligned to the downstream actions they inform, and our methods support this\.

The mismatch between prediction and optimization is studied more broadly in the literature, and our work is the first to connect survival analysis withdecision\-focusedandend\-to\-end learning\[Dontiet al\.,[2017](https://arxiv.org/html/2606.02671#bib.bib57), Wilderet al\.,[2019](https://arxiv.org/html/2606.02671#bib.bib65), Elmachtoub and Grigas,[2022](https://arxiv.org/html/2606.02671#bib.bib64), Mandiet al\.,[2024](https://arxiv.org/html/2606.02671#bib.bib63), Capitaineet al\.,[2025](https://arxiv.org/html/2606.02671#bib.bib62)\]\. These lines of work focus on aligning ML models with the decision\-making tasks they inform\. In healthcare, this mismatch has also been recognized in the context of causal treatment effects\[Vanderschuerenet al\.,[2024](https://arxiv.org/html/2606.02671#bib.bib119), Kamranet al\.,[2024](https://arxiv.org/html/2606.02671#bib.bib43), Frauenet al\.,[2025](https://arxiv.org/html/2606.02671#bib.bib117), Fernández\-Loría and Provost,[2022](https://arxiv.org/html/2606.02671#bib.bib118), Arnoet al\.,[2026](https://arxiv.org/html/2606.02671#bib.bib116)\]\. Our paper focuses on survival analysis, which has the unique challenge of right censorship, and predicting the top\-ranked candidate\. We are the first to employ such techniques for organ allocation\. We provide further discussion of related work in[Appendix˜A](https://arxiv.org/html/2606.02671#A1)\.

## 2Preliminaries

We begin by reviewing standard predictive measures from information retrieval and survival analysis\.

#### DCG and NDCG

Many information retrieval settings are concerned with providing an accurate ranking of a set of datapoints \(e\.g\., recommendations\)\. GivenNNinputs, we typically care about thekkhighest\-ranked predictions rather than the entire population\. LetTiT\_\{i\}denote the relevance \(e\.g\., utility\) of the point rankediith by a prediction model where a lower rank means higher utility\. In standard settings, the ground\-truth relevance is known\. Thediscounted cumulative gain atkk\(DCG​@​k\\text\{DCG\}@\{k\}\)evaluates the quality of the topkkranked items,DCG​@​k=∑i=1kTilog2⁡\(i\+1\)\\text\{DCG\}@\{k\}=\\sum\_\{i=1\}^\{k\}\\frac\{T\_\{i\}\}\{\\log\_\{2\}\(i\+1\)\}\. To normalizeDCG, we compare it to theidealDCG​@​k\\text\{DCG\}@\{k\}\(IDCG​@​k\\text\{IDCG\}@\{k\}\), which is the maximumDCG​@​k\\text\{DCG\}@\{k\}achievable if the ranking were perfectly ordered by the true relevance\. Thenormalized discounted cumulative gain \(NDCG​@​k\\text\{NDCG\}@\{k\}\)isDCG​@​k/IDCG​@​k\\nicefrac\{\{\\text\{DCG\}@\{k\}\}\}\{\{\\text\{IDCG\}@\{k\}\}\}\. So,NDCG​@​k=1\\text\{NDCG\}@\{k\}=1represents a perfect ordering of the topkkpoints\. For further background on information retrieval, we refer toBurges\[[2010](https://arxiv.org/html/2606.02671#bib.bib42)\]andSchützeet al\.\[[2008](https://arxiv.org/html/2606.02671#bib.bib9)\]\.

#### Survival analysis

In the typical setting for survival analysis, we are given a dataset of individualsi∈\{1,…,N\}i\\in\\\{1,\\dots,N\\\}\. LetTi∗T^\{\*\}\_\{i\}denote the true unobserved survival time of personiiandCiC\_\{i\}the censoring time\. The observable random variableTi=min⁡\{Ti∗,Ci\}T\_\{i\}=\\min\\\{T^\{\*\}\_\{i\},C\_\{i\}\\\}and the event indicatorδi=𝕀​\{Ti∗≤Ci\}\\delta\_\{i\}=\\mathbb\{I\}\\\{T^\{\*\}\_\{i\}\\leq C\_\{i\}\\\}where𝕀\\mathbb\{I\}is the binary indicator function\. LetXi∈ℝdX\_\{i\}\\in\\mathbb\{R\}^\{d\}be the baseline covariate vector for patientii\. Instead of predicting an unbounded survival time, we can also shift the target to predict survival within a fixed horizon,τ\\tau, whereTi\(τ\),∗=min⁡\{Ti∗,τ\}T^\{\(\\tau\),\*\}\_\{i\}=\\min\\\{T^\{\*\}\_\{i\},\\tau\\\}andδi\(τ\)=max⁡\{δi,𝕀​\{Ti≥τ\}\}\\delta\_\{i\}^\{\(\\tau\)\}=\\max\\left\\\{\\delta\_\{i\},\\mathbb\{I\}\\\{T\_\{i\}\\geq\\tau\\\}\\right\\\}\. We defineS​\(t∣X\)=ℙ​\(T∗\>t∣X\)S\(t\\mid X\)=\\mathbb\{P\}\(T^\{\*\}\>t\\mid X\)as the true conditional survival function andG​\(t∣X\)=ℙ​\(C\>t∣X\)G\(t\\mid X\)=\\mathbb\{P\}\(C\>t\\mid X\)as the true conditional censoring survival function\. We assume that the covariatesXXcapture both the survival and censoring mechanisms and they are independent conditioned onXX\.

###### Assumption 1\(Conditionally independent censoring\)\.

\(T∗⟂⟂C\)∣X\(T^\{\*\}\\perp\\\!\\\!\\\!\\perp C\)\\mid X\.

For heart transplantation, we aim to predict how long a new organ will sustain a patient following an operation\. A ubiquitous challenge with healthcare datasets is right censorship due to patients stopping reporting\. We know the last time a patient reported their condition, not the true event time\.

To adaptNDCGfor survival data, the relevance score becomes the true survival time \(or the restricted true survival time\)\. However, for censored patients, the true relevance remains unobservable\. Therefore, we require new estimators to computeNDCGfor right censored datasets\.

#### Concordance index

The standard metric for evaluating predicted rankings in survival analysis is theconcordance index \(C\-index\)\. The C\-index measures pairwise accuracy and is defined as the ratio of concordant pairs among all*comparable*pairs of patients\. There are different ways of computing the C\-index for right censored datasets\[Gönen and Heller,[2005](https://arxiv.org/html/2606.02671#bib.bib11), Unoet al\.,[2011](https://arxiv.org/html/2606.02671#bib.bib12)\], and we adopt the commonly used Harrell’s C\-index\[Harrellet al\.,[1982](https://arxiv.org/html/2606.02671#bib.bib13)\]\. These variations do not change the underlying principle of the metric\. A pair of patients\(i,j\)\(i,j\)is comparable if we definitively know which patient experienced the event of interest \(death, graft failure,etc\.\) first\. A pair is comparable only if the patient with the shorter observed time experienced the event \(i\.e\.,Ti<TjT\_\{i\}<T\_\{j\}andδi=1\\delta\_\{i\}=1\)\.

## 3Aligning machine learning predictors with policy optimization

In this section, we analyze how ML predictors interact with the allocation policies they inform\. We begin by presenting a motivating example that highlights the limitations of allocating based on a predictor optimized for C\-index \(or other aggregate metrics\)\. We assume that all utility is non\-negative: following a transplant, the time a patient survives cannot be less than zero\.

### 3\.1A motivating example

Consider a fully\-observed dataset \(e\.g\.,δi=1​∀i\\delta\_\{i\}=1~~\\forall i\), consisting ofNNtransplant candidates and a single donor to be allocated\. The goal is to allocate the donor to the patient with the best outcome \(e\.g\., maximize utility\)\. Suppose the utility of allocating to patientiiisTi∗=iT\_\{i\}^\{\*\}=i\. An allocation algorithm does not know the true utility, but rather relies on a predictive model\.

Now suppose a model correctly ranks the utility for patientsi∈\[2,N\]i\\in\[2,N\], but incorrectly predicts that patient11, who has the worst outcome, has the best one\. When we evaluate the concordance of this model, we obtain a C\-index of1−2N1\-\\frac\{2\}\{N\}\. AsNNgrows, the C\-index quickly approaches11despite the model recommending the patient with the worst outcome\.

ForN=100N=100, the above example results in a C\-index of 0\.98\. This value is much higher than the state\-of\-the\-art models for predicting graft survival following a heart transplant, which are around0\.60\.6\[Aleksovaet al\.,[2020](https://arxiv.org/html/2606.02671#bib.bib52), Leeet al\.,[2018](https://arxiv.org/html/2606.02671#bib.bib28), Anagnostideset al\.,[2025](https://arxiv.org/html/2606.02671#bib.bib32), Ayerset al\.,[2021](https://arxiv.org/html/2606.02671#bib.bib30)\]\. However, it is evident that despite a high C\-index, a model can be arbitrarily bad at predicting the top candidate\. A greedy algorithm allocating transplants based on this predictor would make catastrophic decisions\. The C\-index is not the right measure of a predictor that is being leveraged by a decision\-making algorithm\. We can formalize the failure of matching with concordance for an allocation algorithm that selects only a single candidate using a predictor\.

###### Proposition 1\.

For any deterministic algorithm selecting a single candidate based on a predictorf^\\hat\{f\}, and for anyc∈\(0,1\)c\\in\(0,1\)andρ∈\(0,1\)\\rho\\in\(0,1\), there exists a set ofNNcandidates and a predictorf^\\hat\{f\}with C\-index at leastcc, such that the algorithm achieves at most aρ\\rhofraction of the optimal utility\.

Furthermore, not only do deterministic algorithms fail, but*no algorithm*—even a randomized one—relying on concordance can hope to do better than uniform random guessing\.

###### Proposition 2\.

For any algorithm selecting a single candidate based on a predictorf^\\hat\{f\}and for anyc∈\(0,1\)c\\in\(0,1\), there exists a set ofNNcandidates and a predictorf^\\hat\{f\}with C\-index at leastcc, such that the algorithm cannot guarantee more expected utility than that of a uniform random selection\.

This failure is not unique to the C\-index\. Many aggregate metrics, including standard average errors, also fail to provide utility guarantees\. However, if we compute theNDCG​@​1\\text\{NDCG\}@\{1\}of the above predictor withN=100N=100, we would indeed see the catastrophic performance; theNDCG​@​1=0\.01\\text\{NDCG\}@\{1\}=0\.01\.

### 3\.2Allocating with NDCG

The failure of C\-index arises from its inability to identify high\-utility candidates\. NDCG, on the other hand, captures exactly this quantity, and we can indeed bound the worst\-case utility an allocation algorithm can achieve against the optimal allocation,U​\(OPT\)U\(\\mathrm\{OPT\}\)\. Specifically, given a predictorf^\\hat\{f\}withNDCG​@​k\\text\{NDCG\}@\{k\}at leastα\\alpha, we can randomly select one of the topkkranked candidates proportional to the logarithmic discount in theDCG​@​k\\text\{DCG\}@\{k\}definition\. We refer to this algorithm as theposition\-weighted allocation algorithm\(𝖯𝖶𝖠\\mathsf\{PWA\}\) which we present in[Algorithm˜1](https://arxiv.org/html/2606.02671#alg1)\.

###### Theorem 1\.

GivenNNcandidates and a predictorf^\\hat\{f\}withNDCG​@​k\\text\{NDCG\}@\{k\}at leastα\\alpha, algorithm𝖯𝖶𝖠\\mathsf\{PWA\}\([Algorithm˜1](https://arxiv.org/html/2606.02671#alg1)\) selecting a single candidate based onf^\\hat\{f\}achieves expected utility at leastα/Wk⋅U​\(OPT\)\\nicefrac\{\{\\alpha\}\}\{\{W\_\{k\}\}\}\\cdot U\(\\mathrm\{OPT\}\)whereWk=∑i=1k1/log2⁡\(i\+1\)W\_\{k\}=\\sum\_\{i=1\}^\{k\}\\nicefrac\{\{1\}\}\{\{\\log\_\{2\}\(i\+1\)\}\}\.

The bound we obtain is a function of bothα\\alphaandkk, and degrades approximately linearly inkk\. Fork=5k=5,𝖯𝖶𝖠\\mathsf\{PWA\}guarantees roughlyα/3\\alpha/3of the optimal utility\.NDCG​@​k\\text\{NDCG\}@\{k\}can have higher variance at lower values ofkk, so while optimizing for lower values ofkkis theoretically better, it may be the case that there are practical tradeoffs for robustness\. When evaluating the predictor atk=1k=1, the randomized policy reduces into the greedy algorithm, yielding a direct result for greedy allocation\.

###### Corollary 1\.

GivenNNcandidates and a predictorf^\\hat\{f\}withNDCG​@​1\\text\{NDCG\}@\{1\}at leastα\\alpha, algorithm𝖯𝖶𝖠\\mathsf\{PWA\}selecting a single candidate based onf^\\hat\{f\}achieves utility at leastα⋅U​\(OPT\)\\alpha\\cdot U\(\\mathrm\{OPT\}\)\.

These results stand in stark contrast to the performance of aggregate metrics which fail to guarantee any meaningful utility for single\-item allocation\. To align a data\-driven predictor with the downstream allocation, these results prove that NDCG is a theoretically sound metric to optimize for\.

While[Theorem˜1](https://arxiv.org/html/2606.02671#Thmtheorem1)establishes a bound for a single allocation, real\-world scenarios often require sequential matching\. Over multiple donor arrivals, we can apply the techniques we develop in the remainder of this paper to individual predictors for different representative donor types, consistently optimizing for theNDCG​@​k\\text\{NDCG\}@\{k\}\.

## 4NDCG for censored datasets

We have shown that optimizing for NDCG is a better target for predictors used by a greedy allocation algorithm than aggregate metrics like C\-index\. However, since the true relevance scoreTi∗T^\{\*\}\_\{i\}is unobservable for right censored data points, we cannot directly compute the standard DCG metrics in the survival analysis context\. In this section, we propose two estimators of DCG that account for censoring\. We show that both are unbiased whether we use the true survival time or the restricted survival time as the relevance score\. It follows from the linearity of expectation that the DCG is unbiased if the relevance is conditionally unbiased givenXiX\_\{i\}\. We then discuss how unbiased estimates of DCG translate to estimates of NDCG and evaluating survival models for allocation\.

### 4\.1Unbiased estimates of relevance

The first method replaces the unobserved survival time for censored patients with its conditional expectation given a survival functionS^\\hat\{S\}\. We define theexpected years \(EY\)relevance estimator as

T^iEY=δi​Ti\+\(1−δi\)​𝔼S^​\[Ti∗​∣Ti∗\>​Ti,Xi\]\.\\hat\{T\}\_\{i\}^\{\\text\{EY\}\}=\\delta\_\{i\}T\_\{i\}\+\(1\-\\delta\_\{i\}\)\\mathbb\{E\}\_\{\\hat\{S\}\}\[T^\{\*\}\_\{i\}\\mid T^\{\*\}\_\{i\}\>T\_\{i\},X\_\{i\}\]\.
To prove that the EY estimator is unbiased, we need to assume thatS^\\hat\{S\}is unbiased\.

###### Assumption 2\(Unbiasedness ofS^\\hat\{S\}\)\.

The conditional survival functionS^\\hat\{S\}is conditionally unbiased such that the expected value of the estimated survival times match the true survival time\. That is,𝔼​\[𝔼S^​\[Ti∗​∣Ti∗\>​Ti,Xi\]\]=𝔼​\[Ti∗​∣Ti∗\>​Ti,Xi\]\\mathbb\{E\}\\left\[\\mathbb\{E\}\_\{\\hat\{S\}\}\[T^\{\*\}\_\{i\}\\mid T^\{\*\}\_\{i\}\>T\_\{i\},X\_\{i\}\]\\right\]=\\mathbb\{E\}\[T^\{\*\}\_\{i\}\\mid T^\{\*\}\_\{i\}\>T\_\{i\},X\_\{i\}\]\.

###### Property 1\(Unbiasedness of EY estimator\)\.

Under Assumptions[1](https://arxiv.org/html/2606.02671#Thmassumption1)and[2](https://arxiv.org/html/2606.02671#Thmassumption2),T^iEY\\hat\{T\}^\{\\text\{EY\}\}\_\{i\}is a conditionally unbiased estimator of𝔼​\[Ti∗∣Xi\]\\mathbb\{E\}\[T^\{\*\}\_\{i\}\\mid X\_\{i\}\]\.

Using the same argument, we can also show that the estimator is unbiased in the restricted setting \([Property˜3](https://arxiv.org/html/2606.02671#Thmproperty3)\)\. The theoretical unbiasedness of the EY estimator relies on the assumption that the conditional survival estimator is unbiased over the censored population\. While relying on nuisance estimators222A nuisance parameter is one that is not of primary interest but must be accounted for to analyze the target parameters\.is common in statistics, in some ways this approach seems circular: we require a survival model to compute the DCG of another survival model\. But this also highlights where the power of our bootstrapping framework stems from\. If we can use a survival model as a nuisance estimator, we can also leverage it to bootstrap another model that is optimized for extremal ranking\.

An alternative to imputation is theinverse probability of censoring weighting \(IPCW\)\[Grafet al\.,[1999](https://arxiv.org/html/2606.02671#bib.bib108), Gerds and Schumacher,[2006](https://arxiv.org/html/2606.02671#bib.bib107)\]\. IPCW discards censored patients from the evaluation and re\-weights the observed instances to account for the discarded population, using the censoring functionG^​\(t∣X\)\\hat\{G\}\(t\\mid X\)\. The censoring survival functionG^​\(t∣X\)\\hat\{G\}\(t\\mid X\)is the probability of remaining uncensored up to timettgivenXX\. The IPCW estimator is

T^iIPCW=δi​TiG^​\(Ti∣Xi\)\.\\hat\{T\}^\{\\text\{IPCW\}\}\_\{i\}=\\frac\{\\delta\_\{i\}T\_\{i\}\}\{\\hat\{G\}\(T\_\{i\}\\mid X\_\{i\}\)\}\.
In order to prove unbiasedness of the IPCW estimator, we need to assume thatG^\\hat\{G\}is specified such that the inverse weights correctly recover the population expectation\.

###### Assumption 3\(Correct specification of censoring model\)\.

The conditional censoring modelG^\\hat\{G\}is correctly specified such that it matches the true conditional censoring distributionG​\(t∣X\)=ℙ​\(Ci\>t∣Xi\)G\(t\\mid X\)=\\mathbb\{P\}\(C\_\{i\}\>t\\mid X\_\{i\}\)\. We also assume thatG^​\(t∣X\)\>0\\hat\{G\}\(t\\mid X\)\>0for allt,Xt,X\.

###### Property 2\(Unbiasedness of IPCW estimator\)\.

Under Assumptions[1](https://arxiv.org/html/2606.02671#Thmassumption1)and[3](https://arxiv.org/html/2606.02671#Thmassumption3),T^iIPCW\\hat\{T\}^\{\\text\{IPCW\}\}\_\{i\}is a conditionally unbiased estimator of𝔼​\[Ti∗∣Xi\]\\mathbb\{E\}\[T\_\{i\}^\{\*\}\\mid X\_\{i\}\]\.

As is the case for the EY estimator, the IPCW estimator is also unbiased in the restricted setting \([Property˜4](https://arxiv.org/html/2606.02671#Thmproperty4)\)\. Restricting the horizonτ\\taubounds the maximum weight at1/G^​\(τ∣X\)1/\\hat\{G\}\(\\tau\\mid X\)and is a practical solution for stability since IPCW weights can have high variance\.[Appendix˜C](https://arxiv.org/html/2606.02671#A3)discusses the estimators and the restricted setting further\.

### 4\.2NDCG from unbiased relevance

Given unbiased estimates of relevance, linearity of expectation ensures thatDCG^\\widehat\{\\text\{DCG\}\}andDCG​@​k^\\widehat\{\\text\{DCG\}@\{k\}\}are unbiased\. However, the IDCG requires sorting these relevances\. Because the maximum operator is convex, sorting noisy estimates amplifies positive errors, leading to a positive bias\. As a result, the estimate of NDCG is generally negatively biased\. However, the bias does not affect the relative comparisons of two survival models\. For a given dataset,IDCG^\\widehat\{\\text\{IDCG\}\}acts as a normalizer determined by the ground\-truth scores and any nuisance estimators\. It is independent of the models we evaluate\. Therefore, given two survival models,AAandBB, ifNDCG^A\>NDCG^B\\widehat\{\\text\{NDCG\}\}\_\{A\}\>\\widehat\{\\text\{NDCG\}\}\_\{B\}, thenDCG^A\>DCG^B\\widehat\{\\text\{DCG\}\}\_\{A\}\>\\widehat\{\\text\{DCG\}\}\_\{B\}\. The evaluation preserves the relative ranking of the models, allowing us to determine whether modelAAis superior to modelBB\. In addition,NDCG^\\widehat\{\\text\{NDCG\}\}is scale\-consistent with respect to multiplication\. IfNDCG^A=2​NDCG^B\\widehat\{\\text\{NDCG\}\}\_\{A\}=2\\widehat\{\\text\{NDCG\}\}\_\{B\}, it implies modelAAachieves twice the discounted gain as modelBBon the dataset sinceIDCG^\\widehat\{\\text\{IDCG\}\}cancels out in the ratio\.

## 5Optimizing for NDCG via bootstrapping

In this section, we briefly describe some of the most commonly deployed models for survival prediction, and then present our approach for bootstrapping a survival predictor for superior NDCG\.

#### Baseline models

We include in our evaluation a suite of common predictors used for survival analysis\. These include the non\-parametricKaplan\-Meier \(KM\)estimator\[Kaplan and Meier,[1958](https://arxiv.org/html/2606.02671#bib.bib100)\], the semi\-parametricCox regression \(Cox\)model\[Cox,[1972](https://arxiv.org/html/2606.02671#bib.bib23)\], the fully parametricaccelerated failure time \(AFT\)model\[Wei,[1992](https://arxiv.org/html/2606.02671#bib.bib101)\], and the deep neural network modelsDeepSurv\[Katzmanet al\.,[2018](https://arxiv.org/html/2606.02671#bib.bib102)\]andDeepHit\[Leeet al\.,[2018](https://arxiv.org/html/2606.02671#bib.bib28)\]\. More details of these predictors can be found in[Appendix˜D](https://arxiv.org/html/2606.02671#A4)\. These models are typically evaluated on their C\-index\.

We now show how to leverage a conditional survival predictorS^​\(t∣X\)\\hat\{S\}\(t\\mid X\)to bootstrap another conditional survival predictor that is optimized specifically for NDCG\.

#### Architecture

Our bootstrapping framework operates using a two\-stage approach\. First, a base\-survival model \(e\.g\., a model from above\) is trained on the censored survival data to produce a conditional survival functionS^​\(t∣X\)\\hat\{S\}\(t\\mid X\)\. Using this baseline, we compute the imputed label given the covariatesXiX\_\{i\}, and a restriction to the horizon,τ\\tau,y^​\(Xi,τ\)=Ti\+∫TiτS^​\(t∣Xi\)/S^​\(Ti∣Xi\)​𝑑t\\hat\{y\}\(X\_\{i\},\\tau\)=T\_\{i\}\+\\int\_\{T\_\{i\}\}^\{\\tau\}\\nicefrac\{\{\\hat\{S\}\(t\\mid X\_\{i\}\)\}\}\{\{\\hat\{S\}\(T\_\{i\}\\mid X\_\{i\}\)\}\}dt, or if necessary, we default to the restricted meany^​\(Xi,τ\)=∫0τS^​\(t∣Xi\)​𝑑t\\hat\{y\}\(X\_\{i\},\\tau\)=\\int\_\{0\}^\{\\tau\}\\hat\{S\}\(t\\mid X\_\{i\}\)dt\. To construct the training labels for the second stage, we create a pseudo\-labelyi∗y\_\{i\}^\{\*\}that blends the observed outcomes with the model’s conditional expectations to handle censoring\. For a patient with observed timeTiT\_\{i\}and event indicatorδi\\delta\_\{i\}, we setyi∗y\_\{i\}^\{\*\}asτ\\tauifTi≥τT\_\{i\}\\geq\\tau,TiT\_\{i\}ifTi<τT\_\{i\}<\\tauandδi=1\\delta\_\{i\}=1, andy^​\(Xi,τ\)\\hat\{y\}\(X\_\{i\},\\tau\)ifTi<τT\_\{i\}<\\tauandδi=0\\delta\_\{i\}=0\. Finally, we train a model, denotedfS^​\(Xi\)f\_\{\\hat\{S\}\}\(X\_\{i\}\), to predictyi∗y\_\{i\}^\{\*\}using an objective function designed to optimize the ranking quality and the prediction error\. We utilize a gradient\-boosted decision tree as the underlying architecture\. Other architectures are possible within this framework as well, such as a deep neural network\.

#### Loss function

We train the second model using a hybrid objective function to balance the identification of the top patients with the accuracy of the expected survival times\. The total lossℒhybrid\\mathcal\{L\}\_\{\\text\{hybrid\}\}is a convex combination of themean squared error \(MSE\)and a pairwise ranking penalty, motivated by LambdaRank\[Burges,[2010](https://arxiv.org/html/2606.02671#bib.bib42)\]\. We combine the two loss functions using a hyperparameterα∈\[0,1\]\\alpha\\in\[0,1\],ℒhybrid=α​ℒMSE\+\(1−α\)​ℒrank\\mathcal\{L\}\_\{\\text\{hybrid\}\}=\\alpha\\mathcal\{L\}\_\{\\text\{MSE\}\}\+\(1\-\\alpha\)\\mathcal\{L\}\_\{\\text\{rank\}\}and evaluate over a resampled mini\-batch \(i\.e\., query group\) of patientsBB\. The regression component,ℒMSE=1\|B\|​∑i\(fS^​\(Xi\)−yi∗\)2\\mathcal\{L\}\_\{\\text\{MSE\}\}=\\frac\{1\}\{\|B\|\}\\sum\_\{i\}\(f\_\{\\hat\{S\}\}\(X\_\{i\}\)\-y\_\{i\}^\{\*\}\)^\{2\}, provides stability by ensuring the predicted survival time does not deviate too much from the imputed labels\. The ranking component,ℒrank\\mathcal\{L\}\_\{\\text\{rank\}\}, optimizes for NDCG through pairwise losses scaled by the change in NDCG\. We consider all pairs\(i,j\)\(i,j\)where patientiioutlived patientjj\. That is,yi∗\>yj∗y\_\{i\}^\{\*\}\>y\_\{j\}^\{\*\}\. If the predicted differencefS^​\(Xi\)−fS^​\(Xj\)f\_\{\\hat\{S\}\}\(X\_\{i\}\)\-f\_\{\\hat\{S\}\}\(X\_\{j\}\)is less than a marginmm, we apply a penalty that is scaled by\|Δ​NDCGi,j\|\|\\Delta\\text\{NDCG\}\_\{i,j\}\|, representing the absolute change in the batch’s overall NDCG if patientiiandjjwere to swap ranks,ℒrank=∑i,j∈B​∣yi∗\>​yj∗12max\{0,m−\(fS^\(Xi\)−fS^\(Xj\)\)\}2⋅\|ΔNDCGi,j\|\\mathcal\{L\}\_\{\\text\{rank\}\}=\\sum\_\{i,j\\in B\\mid y\_\{i\}^\{\*\}\>y\_\{j\}^\{\*\}\}\\frac\{1\}\{2\}\\max\\left\\\{0,m\-\\left\(f\_\{\\hat\{S\}\}\(X\_\{i\}\)\-f\_\{\\hat\{S\}\}\(X\_\{j\}\)\\right\)\\right\\\}^\{2\}\\cdot\|\\Delta\\text\{NDCG\}\_\{i,j\}\|\. Scaling by the batch\-wide\|Δ​NDCGi,j\|\|\\Delta\\text\{NDCG\}\_\{i,j\}\|ensures the model prioritizes a correct ordering of the top\-ranked candidates, effectively maximizing the utility of the resulting allocation policy\. We leverage the batch\-wide metric rather thanΔ​NDCG​@​ki,j\\Delta\\text\{NDCG\}@\{k\}\_\{i,j\}to provide a smoother gradient signal\.

## 6Experiments

We utilize the United Network for Organ Sharing \(UNOS\) patient registry containing clinical data for adult heart transplants in the US dating back to 1987\. We provide a summary of the dataset’s characteristics in[Table˜A4](https://arxiv.org/html/2606.02671#A7.T4)\. Our learning objective is to estimategraft survival, defined as the time elapsed from transplant to organ failure or recipient death\. To ensure stability, we restrict to predicting up toτ=20\\tau=20years, aligning with the 95th percentile of censoring times in the data \([Table˜A5](https://arxiv.org/html/2606.02671#A7.T5)\)\.

An alternative goal to graft survival prediction would be to predictlife\-years gained \(LYG\)of a transplant, which is the difference in years of life for a patient with and without a transplant\. While LYG is often used for allocation, it relies on the graft survival prediction\. For heart allocation, graft survival prediction is also the driving factor of LYG: since conditioned on being waitlisted, a patient’s survival without a transplant is typically short, the post\-transplant outcome dominates LYG\[Colvinet al\.,[2025](https://arxiv.org/html/2606.02671#bib.bib112)\]\.

### 6\.1Warm\-up: Artificial censoring

We begin by evaluating our bootstrapping framework and our NDCG estimators underartificial censoring\. We restrict the initial cohort to only patients with observed events, obtaining a ground\-truth dataset where the exact survival timeTi∗T\_\{i\}^\{\*\}is known for every individual\. We then introduce artificial censoring to simulate the information loss present in real\-world clinical registries\. We provide further details on the experimental setup in[Appendix˜F](https://arxiv.org/html/2606.02671#A6)\.

#### Results of bootstrapping

[Table˜1](https://arxiv.org/html/2606.02671#S6.T1)presents the comparative performance between baseline survival predictors and their bootstrapped counterparts, and we show the percent gain in NDCG in[Figure˜2](https://arxiv.org/html/2606.02671#S6.F2)\. We report ground\-truth metrics calculated using the hidden true event times\. Across all baseline models, our bootstrapping framework consistently yields substantial improvements inNDCG​@​k\\text\{NDCG\}@\{k\}\. We see that the standard survival models achieve poorNDCG​@​k\\text\{NDCG\}@\{k\}, with the values consistently below 0\.3 fork=1k=1regardless of the model\. In contrast, our approach improves upon this to over0\.40\.4and as high as0\.50\.5forNDCG​@​1\\text\{NDCG\}@\{1\}\. The increase forNDCG​@​1\\text\{NDCG\}@\{1\}is substantial, with the largest improvement over 0\.2 when we bootstrap, and our method consistently yielding at least a50%50\\%increase\.

The gain in NDCG is achieved without compromising other metrics; our framework maintains nearly identical \(in fact, marginally superior\) C\-index and AUC scores compared to the standard models\. We observe that the full NDCG is around 0\.9 across all models\. This is largely due to the density of survival times in the cohort, where the cumulative sum of relevance is dominated by the mass of long\-term survivors, making the metric less sensitive to individual swaps\.

Table 1:Performance of models on ground\-truth metrics under artificial censoring\.![Refer to caption](https://arxiv.org/html/2606.02671v1/x1.png)Figure 2:Average%\\%gain inNDCG​@​k\\text\{NDCG\}@\{k\}of bootstrapped model over baseline predictor\.
#### Accuracy of NDCG estimators

We evaluate the fidelity of the EY and IPCW estimators for NDCG by comparing their outputs against the ground\-truth NDCG available from our artificial censoring\. For the EY estimator, we utilize all baseline survival predictors as nuisance models\. For the IPCW estimator, we employ the KM and Cox nuisance models\.[Figure˜A1](https://arxiv.org/html/2606.02671#A7.F1)displays themean absolute error \(MAE\)for each estimator across different values ofkk\. We observe that the estimation error generally decreases askkincreases, approaching near\-zero error for the full NDCG estimation\. This trend is expected, as larger values ofkkaggregate more points, which averages out individual estimation errors\. ForNDCG​@​1\\text\{NDCG\}@\{1\}, the MAE exceeds 0\.1\. The IPCW estimator has the lowest average error askkincreases\. In general, we expect some error, particularly due to the bias detailed in[Section˜4\.2](https://arxiv.org/html/2606.02671#S4.SS2)\.

Despite the moderate absolute error, the estimators demonstrate strong correlation when compared to the ground\-truth NDCG \([Figure˜A2](https://arxiv.org/html/2606.02671#A7.F2)\)\. The EY estimator, aggregated across all nuisance models, exhibits a high degree of correlation with the true NDCG, achieving a Spearman rank correlation coefficient of0\.80\.8fork∈\{1,5,10,50\}k\\in\\\{1,5,10,50\\\}\. In contrast, the IPCW estimator displays higher variance; while individual estimates are noisier, the mean of the distribution tracks the ground\-truth effectively\. We observe some positive bias in the EY estimator likely due to it over\-estimating population survival\.

Finally, we evaluate therelativeaccuracy of our estimators\. Specifically, their reliability in performing model selection\. We report the pairwise concordance: for all pairs of survival models,\(A,B\)\(A,B\), we determine the frequency with which the estimator correctly identifies the superior model\. That is, whetherNDCG^A\>NDCG^B\\widehat\{\\text\{NDCG\}\}\_\{A\}\>\\widehat\{\\text\{NDCG\}\}\_\{B\}given thatNDCGA\>NDCGB\\text\{NDCG\}\_\{A\}\>\\text\{NDCG\}\_\{B\}\. As shown in[Table˜A6](https://arxiv.org/html/2606.02671#A7.T6), our estimators are consistent at predicting the relative ranking of models\. We also evaluate an ensemble estimation, which averages the estimates over all nuisance models\. The EY estimator demonstrates superior reliability, correctly identifying the better\-performing model in over 80% of cases, and reaching 90% accuracy in many instances\. While the IPCW estimator is more volatile, it still achieves over 75% pairwise accuracy fork=1k=1\. These results establish that both estimators, particularly the EY approach, are robust tools for model selection in the presence of censoring\.

### 6\.2Evaluating on the full dataset

Having validated our estimators and our bootstrapping approach under artificial censoring, we now apply our methods to the complete UNOS registry\. Unlike the artificial setup, the survival times of censored patients in this dataset are truly unknown\. We rely on the NDCG estimators to assess the ranking performance\. We summarize the results in[Table˜2](https://arxiv.org/html/2606.02671#S6.T2)and visually show the percent gain in the NDCG estimates in[Figure˜3](https://arxiv.org/html/2606.02671#S6.F3)\. The results show a consistent performance gain across all baseline predictors when integrated into our bootstrapping framework\. The C\-index andAUC at 5 years \(AUC@5\)again are very slightly improved for every baseline model tested when using bootstrapping\. The AUC@5 measures the model’s ability to distinguish between graft failure and survival at five years post\-transplant\. These results suggest that the architecture we use for the bootstrapped model may be an effective model for this predictive task\.

When evaluating the estimatedNDCG​@​k\\text\{NDCG\}@\{k\}, we see the true power of bootstrapping\. Across both EY and IPCW estimators, our framework outperforms standard survival models, and by substantial margins at small values ofkk, regardless of the underlying imputer\. While theNDCG​@​1\\text\{NDCG\}@\{1\}is comparable to theNDCG​@​5\\text\{NDCG\}@\{5\}for our approach, the variance of the estimation decreases askkincreases\. This is one reason why we may examine a largerkkwhen selecting a predictor\. The IPCW estimates exhibit higher variance, but they largely trend in the same direction as the EY estimates, confirming that our approach is effective at improving the identification of high\-utility candidates\.

Table 2:Performance of models on metrics and estimated NDCG \(avg over all nuisance models\)\.![Refer to caption](https://arxiv.org/html/2606.02671v1/x2.png)\(a\)Average%\\%gain of EY estimation\.
![Refer to caption](https://arxiv.org/html/2606.02671v1/x3.png)\(b\)Average%\\%gain of IPCW estimation\.

Figure 3:Average%\\%gain inNDCG​@​k\\text\{NDCG\}@\{k\}estimations of bootstrapped model over baseline predictor\.

## 7Limitations

Relying on real\-world historical data introduces inherent limitations\. The UNOS registry, like many medical databases, contains erroneous entries and missing data\. We preprocess the dataset to impute missing entries and filter inconsistent values\. However, there may be unobserved confounders absent from the registry which influence outcomes\. Systemic biases also present in historical clinical decisions can inadvertently be translated to the predictors\. Despite this limitation, our bootstrapping approach is complementary to future advancements in the underlying survival predictors\.

The unbiasedness of our DCG estimators relies on strong assumptions regarding the nuisance models\. While theoretically necessary, the assumptions are not always achievable in practice\. If the underlying survival model suffers from miscalibration, the EY estimator will inherit this bias\. The IPCW estimator similarly requires that the censoring distribution is correctly specified\. If censoring is highly informative or the probability of being uncensored approaches zero, the IPCW weights lead to very high\-variance estimates\. This motivates the use of restricted horizons\. However, restricted horizons come at the cost of expressivity\. For our application, this restriction is not a major limitation \(since the median graft survival is well under 20 years\), but it could be a barrier to other applications\. Finally, even with unbiased DCG estimators, the resulting NDCG estimate is not unbiased\. Our framework reliably determines relative performance and aligns the learning with the allocation, but cannot determine the exact ground\-truth NDCG\. These limitations underscore the difficulty of predictive tasks under right censorship and promote future work in bias mitigation and model calibration\.

## 8Conclusions

We addressed the misalignment between predictive components for survival analysis and the requirements of decision making\. We demonstrated that predictors with standard aggregate metrics cannot guarantee any utility when used for allocation\. To bridge this gap, we establishedNDCG​@​1\\text\{NDCG\}@\{1\}as a theoretically grounded measure that directly translates to allocation performance\.

In addition, we developed novel estimators of NDCG for right censored datasets and proposed a bootstrapping approach to optimize survival models for NDCG\. Our empirical results on real heart transplant data showed the effectiveness of our estimators at determining model superiority and the substantial increase in NDCG when using bootstrapping\. By aligning the objectives of survival modeling with allocation optimization, our framework provides a scalable template for improving outcomes in organ transplantation and other domains of decision\-making under uncertainty\.

## Acknowledgments

Tuomas Sandholm and his PhD students Ioannis Anagnostides and Itai Zilberstein are supported by NIH award A240108S001, the Vannevar Bush Faculty Fellowship ONR N00014\-23\-1\-2876, and National Science Foundation grant RI\-2312342\. Itai Zilberstein is also supported by the NSF Graduate Research Fellowship Program under grant DGE2140739\. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author\(s\) and do not necessarily reflect the views of the funding agencies\.

## References

- D\. J\. Abraham, A\. Blum, and T\. Sandholm \(2007\)Clearing algorithms for barter exchange markets: enabling nationwide kidney exchanges\.InACM Conference on Electronic Commerce \(EC\),Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px5.p2.1),[§1](https://arxiv.org/html/2606.02671#S1.p3.1)\.
- N\. Aleksova, A\. C\. Alba, V\. M\. Molinero, K\. Connolly, A\. Orchanian\-Cheff, M\. Badiwala, H\. J\. Ross, and J\. G\. D\. Posada \(2020\)Risk prediction models for survival after heart transplantation: a systematic review\.American Journal of Transplantation20\(4\),pp\. 1137–1151\.Cited by:[§3\.1](https://arxiv.org/html/2606.02671#S3.SS1.p3.2)\.
- E\. Allen, R\. Taylor, A\. Gimson, and D\. Thorburn \(2024\)Transplant benefit\-based offering of deceased donor livers in the united kingdom\.Journal of Hepatology81\(3\),pp\. 471–478\.Cited by:[§1](https://arxiv.org/html/2606.02671#S1.p3.1)\.
- I\. Anagnostides, Z\. Sollie, A\. Kilic, and T\. Sandholm \(2025\)Policy optimization for dynamic heart transplant allocation\.Circulation152\(Suppl\_3\),pp\. A4369427–A4369427\.Cited by:[§3\.1](https://arxiv.org/html/2606.02671#S3.SS1.p3.2)\.
- I\. Anagnostides, I\. Zilberstein, Z\. W\. Sollie, A\. Kilic, and T\. Sandholm \(2026\)Position: machine learning for heart transplant allocation policy optimization should account for incentives\.InInternational Conference on Machine Learning \(ICML\),Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px5.p2.1)\.
- H\. Arno, D\. Frauen, E\. Javurek, T\. Demeester, and S\. Feuerriegel \(2026\)Rank\-learner: orthogonal ranking of treatment effects\.InInternational Conference on Machine Learning \(ICML\),Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px6.p1.3),[§1](https://arxiv.org/html/2606.02671#S1.SS0.SSS0.Px1.p7.1)\.
- P\. Awasthi and T\. Sandholm \(2009\)Online stochastic optimization in the large: application to kidney exchange\.InInternational Joint Conference on Artificial Intelligence \(IJCAI\),Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px5.p2.1)\.
- B\. Ayers, T\. Sandholm, I\. Gosev, S\. Prasad, and A\. Kilic \(2021\)Using machine learning to improve survival prediction after heart transplantation\.Journal of Cardiac Surgery36\(11\),pp\. 4113–4120\.Cited by:[§3\.1](https://arxiv.org/html/2606.02671#S3.SS1.p3.2)\.
- J\. Berrevoets, A\. M\. Alaa, Z\. Qian, J\. Jordon, A\. E\. S\. Gimson, and M\. van der Schaar \(2021\)Learning queueing policies for organ transplantation allocation using interpretable counterfactual survival analysis\.InInternational Conference on Machine Learning \(ICML\),Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px5.p2.1),[§1](https://arxiv.org/html/2606.02671#S1.p4.1)\.
- J\. Berrevoets, J\. Jordon, I\. Bica, and M\. van der Schaar \(2020\)OrganITE: optimal transplant donor organ offering using an individual treatment effect\.InNeural Information Processing Systems \(NeurIPS\),Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px5.p2.1),[§1](https://arxiv.org/html/2606.02671#S1.p4.1)\.
- N\. E\. Breslow \(1975\)Analysis of survival data under the proportional hazards model\.International Statistical Review,pp\. 45–57\.Cited by:[2nd item](https://arxiv.org/html/2606.02671#A4.I1.i2.p2.2)\.
- C\. J\. Burges \(2010\)From RankNet to LambdaRank to LambdaMART: An overview\.Learning11\(23\-581\),pp\. 81\.Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px4.p1.1),[§2](https://arxiv.org/html/2606.02671#S2.SS0.SSS0.Px1.p1.16),[§5](https://arxiv.org/html/2606.02671#S5.SS0.SSS0.Px3.p1.18)\.
- R\. Burke, A\. Felfernig, and M\. H\. Göker \(2011\)Recommender systems: an overview\.AI Magazine32\(3\),pp\. 13–18\.Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px4.p1.1)\.
- M\. Cameli, M\. C\. Pastore, A\. Campora, M\. Lisi, and G\. E\. Mandoli \(2022\)Donor shortage in heart transplantation: how can we overcome this challenge?\.Frontiers in Cardiovascular Medicine9,pp\. 1001002\.Cited by:[§1](https://arxiv.org/html/2606.02671#S1.p2.1)\.
- Z\. Cao, T\. Qin, T\. Liu, M\. Tsai, and H\. Li \(2007\)Learning to rank: from pairwise approach to listwise approach\.InInternational Conference on Machine Learning \(ICML\),Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px4.p1.1)\.
- A\. Capitaine, M\. Haddouche, E\. Moulines, M\. I\. Jordan, E\. Boursier, and A\. Durmus \(2025\)Online decision\-focused learning\.arXiv:2505\.13564\.Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px1.p2.1),[§1](https://arxiv.org/html/2606.02671#S1.SS0.SSS0.Px1.p7.1)\.
- M\. M\. Colvin, J\. M\. Smith, Y\. S\. Ahn, K\. A\. Lindblad, D\. Handarova, A\. K\. Israni, and J\. J\. Snyder \(2025\)OPTN/SRTR 2023 annual data report: heart\.American Journal of Transplantation25\(2\),pp\. S329–S421\.Cited by:[§6](https://arxiv.org/html/2606.02671#S6.p2.1),[footnote 1](https://arxiv.org/html/2606.02671#footnote1)\.
- D\. R\. Cox \(1972\)Regression models and life\-tables\.Journal of the Royal Statistical Society: Series B \(Methodological\)34\(2\),pp\. 187–202\.Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px5.p1.1),[2nd item](https://arxiv.org/html/2606.02671#A4.I1.i2.p1.1),[§1](https://arxiv.org/html/2606.02671#S1.p4.1),[§5](https://arxiv.org/html/2606.02671#S5.SS0.SSS0.Px1.p1.1)\.
- A\. P\. Dawid \(1982\)The well\-calibrated Bayesian\.Journal of the American Statistical Association77\(379\),pp\. 605–610\.Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px3.p1.1)\.
- J\. P\. Dickerson, D\. F\. Manlove, B\. Plaut, T\. Sandholm, and J\. Trimble \(2016\)Position\-indexed formulations for kidney exchange\.InACM Conference on Economics and Computation \(EC\),Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px5.p2.1)\.
- J\. P\. Dickerson and T\. Sandholm \(2015\)FutureMatch: combining human value judgments and machine learning to match in dynamic environments\.InConference on Artificial Intelligence \(AAAI\),Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px5.p1.1),[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px5.p2.1)\.
- P\. L\. Donti, J\. Z\. Kolter, and B\. Amos \(2017\)Task\-based end\-to\-end model learning in stochastic optimization\.InNeural Information Processing Systems \(NeurIPS\),Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px1.p2.1),[§1](https://arxiv.org/html/2606.02671#S1.SS0.SSS0.Px1.p7.1)\.
- A\. N\. Elmachtoub and P\. Grigas \(2022\)Smart "predict, then optimize"\.Management Science68\(1\),pp\. 9–26\.Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px1.p2.1),[§1](https://arxiv.org/html/2606.02671#S1.SS0.SSS0.Px1.p7.1)\.
- C\. Fernández\-Loría and F\. Provost \(2022\)Causal decision making and causal effect estimation are not the same… and why it matters\.INFORMS Journal on Data Science1\(1\),pp\. 4–16\.Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px6.p1.3),[§1](https://arxiv.org/html/2606.02671#S1.SS0.SSS0.Px1.p7.1)\.
- D\. P\. Foster and R\. V\. Vohra \(1998\)Asymptotic calibration\.Biometrika85\(2\),pp\. 379–390\.Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px3.p1.1)\.
- D\. Frauen, V\. Melnychuk, J\. Schweisthal, M\. van der Schaar, and S\. Feuerriegel \(2025\)Treatment effect estimation for optimal decision\-making\.arXiv preprint arXiv:2505\.13092\.Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px6.p1.3),[§1](https://arxiv.org/html/2606.02671#S1.SS0.SSS0.Px1.p7.1)\.
- T\. A\. Gerds and M\. Schumacher \(2006\)Consistent estimation of the expected Brier score in general survival models with right\-censored event times\.Biometrical Journal48\(6\),pp\. 1029–1040\.Cited by:[§4\.1](https://arxiv.org/html/2606.02671#S4.SS1.p5.4)\.
- M\. Gönen and G\. Heller \(2005\)Concordance probability and discriminatory power in proportional hazards regression\.Biometrika92\(4\),pp\. 965–970\.Cited by:[§2](https://arxiv.org/html/2606.02671#S2.SS0.SSS0.Px3.p1.3)\.
- J\. Gottlieb, J\. Smits, R\. Schramm, F\. Langer, R\. Buhl, C\. Witt, M\. Strueber, and H\. Reichenspurner \(2017\)Lung transplantation in Germany since the introduction of the lung allocation score: a retrospective analysis\.Deutsches Arzteblatt International114\(11\),pp\. 179\.Cited by:[§1](https://arxiv.org/html/2606.02671#S1.p3.1)\.
- E\. Graf, C\. Schmoor, W\. Sauerbrei, and M\. Schumacher \(1999\)Assessment and comparison of prognostic classification schemes for survival data\.Statistics in Medicine18\(17\-18\),pp\. 2529–2545\.Cited by:[§4\.1](https://arxiv.org/html/2606.02671#S4.SS1.p5.4)\.
- F\. E\. Harrell, R\. M\. Califf, D\. B\. Pryor, K\. L\. Lee, and R\. A\. Rosati \(1982\)Evaluating the yield of medical tests\.Journal of the American Medical Association \(JAMA\)247\(18\),pp\. 2543–2546\.Cited by:[§2](https://arxiv.org/html/2606.02671#S2.SS0.SSS0.Px3.p1.3)\.
- K\. Järvelin and J\. Kekäläinen \(2002\)Cumulated gain\-based evaluation of IR techniques\.ACM Transactions on Information Systems20\(4\),pp\. 422–446\.Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px4.p1.1),[§1](https://arxiv.org/html/2606.02671#S1.SS0.SSS0.Px1.p4.1)\.
- P\. S\. Kamath, R\. H\. Wiesner, M\. Malinchoc, W\. Kremers, T\. M\. Therneau, C\. L\. Kosberg, G\. D’Amico, E\. R\. Dickson, and W\. R\. Kim \(2001\)A model to predict survival in patients with end\-stage liver disease\.Hepatology33\(2\),pp\. 464–470\.Cited by:[§1](https://arxiv.org/html/2606.02671#S1.p3.1)\.
- F\. Kamran, M\. Makar, and J\. Wiens \(2024\)Learning to rank for optimal treatment allocation under resource constraints\.InInternational Conference on Artificial Intelligence and Statistics \(AISTATS\),Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px4.p1.1),[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px6.p1.3),[§1](https://arxiv.org/html/2606.02671#S1.SS0.SSS0.Px1.p7.1)\.
- E\. L\. Kaplan and P\. Meier \(1958\)Nonparametric estimation from incomplete observations\.Journal of the American Statistical Association53\(282\),pp\. 457–481\.Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px5.p1.1),[1st item](https://arxiv.org/html/2606.02671#A4.I1.i1.p1.2),[§5](https://arxiv.org/html/2606.02671#S5.SS0.SSS0.Px1.p1.1)\.
- J\. L\. Katzman, U\. Shaham, A\. Cloninger, J\. Bates, T\. Jiang, and Y\. Kluger \(2018\)DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network\.BMC Medical Research Methodology18\(1\),pp\. 24\.Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px5.p1.1),[4th item](https://arxiv.org/html/2606.02671#A4.I1.i4.p1.1),[§1](https://arxiv.org/html/2606.02671#S1.p4.1),[§5](https://arxiv.org/html/2606.02671#S5.SS0.SSS0.Px1.p1.1)\.
- C\. Lee, W\. Zame, J\. Yoon, and M\. van der Schaar \(2018\)DeepHit: a deep learning approach to survival analysis with competing risks\.InConference on Artificial Intelligence \(AAAI\),Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px5.p1.1),[5th item](https://arxiv.org/html/2606.02671#A4.I1.i5.p1.1),[§1](https://arxiv.org/html/2606.02671#S1.p4.1),[§3\.1](https://arxiv.org/html/2606.02671#S3.SS1.p3.2),[§5](https://arxiv.org/html/2606.02671#S5.SS0.SSS0.Px1.p1.1)\.
- J\. Mandi, J\. Kotary, S\. Berden, M\. Mulamba, V\. Bucarey, T\. Guns, and F\. Fioretto \(2024\)Decision\-focused learning: foundations, state of the art, benchmark and future opportunities\.Journal of Artificial Intelligence Research \(JAIR\)80,pp\. 1623–1701\.Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px1.p2.1),[§1](https://arxiv.org/html/2606.02671#S1.SS0.SSS0.Px1.p7.1)\.
- G\. Mayer and G\. G\. Persijn \(2006\)Eurotransplant kidney allocation system \(ETKAS\): rationale and implementation\.Nephrology Dialysis Transplantation21\(1\),pp\. 2–3\.Cited by:[§1](https://arxiv.org/html/2606.02671#S1.p3.1)\.
- M\. Mitzenmacher and S\. Vassilvitskii \(2022\)Algorithms with predictions\.Communications of the ACM65\(7\),pp\. 33–35\.Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px2.p1.1)\.
- C\. Nagpal, S\. Yadlowsky, N\. Rostamzadeh, and K\. Heller \(2021\)Deep cox mixtures for survival regression\.InMachine Learning for Healthcare Conference,pp\. 674–708\.Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px5.p1.1),[§1](https://arxiv.org/html/2606.02671#S1.p4.1)\.
- OPTN \(2025\)Continuous distribution\.Organ Procurement and Transplantation Network \(OPTN\)\.Note:[https://www\.hrsa\.gov/optn/policies\-bylaws/policy\-issues/continuous\-distribution](https://www.hrsa.gov/optn/policies-bylaws/policy-issues/continuous-distribution)Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px5.p1.1),[§1](https://arxiv.org/html/2606.02671#S1.p3.1)\.
- T\. Papalexopoulos, J\. Alcorn, D\. Bertsimas, R\. Goff, D\. Stewart, and N\. Trichakis \(2024\)Reshaping national organ allocation policy\.Operations Research72\(4\),pp\. 1475–1486\.Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px5.p1.1),[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px5.p2.1),[§1](https://arxiv.org/html/2606.02671#S1.p3.1)\.
- J\. Perdomo, T\. Zrnic, C\. Mendler\-Dünner, and M\. Hardt \(2020\)Performative prediction\.InInternational Conference on Machine Learning \(ICML\),Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px3.p2.1)\.
- M\. Richardson, E\. Dominowska, and R\. Ragno \(2007\)Predicting clicks: estimating the click\-through rate for new ads\.InInternational Conference on World Wide Web \(WWW\),Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px4.p1.1)\.
- M\. Rossetti, F\. Stella, and M\. Zanker \(2016\)Contrasting offline and online results when evaluating recommendation algorithms\.InACM Conference on Recommender Systems,pp\. 31–34\.Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px4.p1.1)\.
- H\. Schütze, C\. D\. Manning, and P\. Raghavan \(2008\)Introduction to information retrieval\.Vol\.39,Cambridge University Press Cambridge\.Cited by:[§2](https://arxiv.org/html/2606.02671#S2.SS0.SSS0.Px1.p1.16)\.
- J\. H\. Shen, E\. Vitercik, and A\. Wikum \(2025\)Algorithms with calibrated machine learning predictions\.InInternational Conference on Machine Learning \(ICML\),Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px3.p1.1)\.
- P\. Shojaee, X\. Chen, and R\. Jin \(2021\)Adaptively weighted top\-N recommendation for organ matching\.ACM Transactions on Computing for Healthcare3\(1\),pp\. 1–29\.Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px5.p2.1)\.
- S\. Shore, J\. R\. Golbus, K\. D\. Aaronson, and B\. K\. Nallamothu \(2020\)Changes in the United States adult heart allocation policy: challenges and opportunities\.Circulation: Cardiovascular Quality and Outcomes13\(10\),pp\. e005795\.Cited by:[§1](https://arxiv.org/html/2606.02671#S1.p3.1)\.
- X\. Su and S\. Zenios \(2004\)Patient choice in kidney allocation: the role of the queueing discipline\.Manufacturing & Service Operations Management6\(4\),pp\. 280–301\.Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px5.p2.1)\.
- Y\. Tao and H\. Xu \(2026\)Necessary optimality conditions for integrated learning and optimization problem in contextual optimization\.arXiv:2601\.16581\.Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px1.p2.1)\.
- H\. Uno, T\. Cai, M\. J\. Pencina, R\. B\. D’Agostino, and L\. Wei \(2011\)On the C\-statistics for evaluating overall adequacy of risk prediction procedures with censored survival data\.Statistics in Medicine30\(10\),pp\. 1105–1117\.Cited by:[§2](https://arxiv.org/html/2606.02671#S2.SS0.SSS0.Px3.p1.3)\.
- T\. Vanderschueren, W\. Verbeke, F\. Moraes, and H\. M\. Proença \(2024\)Metalearners for ranking treatment effects\.arXiv preprint arXiv:2405\.02183\.Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px6.p1.3),[§1](https://arxiv.org/html/2606.02671#S1.SS0.SSS0.Px1.p7.1)\.
- C\. Wang \(2023\)Calibration in deep learning: a survey of the state\-of\-the\-art\.arXiv:2308\.01222\.Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px3.p1.1)\.
- Y\. Wang, L\. Wang, Y\. Li, D\. He, and T\. Liu \(2013\)A theoretical analysis of NDCG type ranking measures\.InConference on Learning Theory \(COLT\),Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px4.p1.1),[§1](https://arxiv.org/html/2606.02671#S1.SS0.SSS0.Px1.p4.1)\.
- L\. Wei \(1992\)The accelerated failure time model: a useful alternative to the Cox regression model in survival analysis\.Statistics in Medicine11\(14\-15\),pp\. 1871–1879\.Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px5.p1.1),[3rd item](https://arxiv.org/html/2606.02671#A4.I1.i3.p1.2),[§1](https://arxiv.org/html/2606.02671#S1.p4.1),[§5](https://arxiv.org/html/2606.02671#S5.SS0.SSS0.Px1.p1.1)\.
- B\. Wilder, B\. Dilkina, and M\. Tambe \(2019\)Melding the data\-decisions pipeline: decision\-focused learning for combinatorial optimization\.InConference on Artificial Intelligence \(AAAI\),Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px1.p1.1),[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px1.p2.1),[§1](https://arxiv.org/html/2606.02671#S1.SS0.SSS0.Px1.p7.1)\.
- K\. C\. Zhang, N\. Narang, C\. Jasseron, R\. Dorent, K\. A\. Lazenby, M\. N\. Belkin, J\. Grinstein, A\. Mayampurath, M\. M\. Churpek, K\. K\. Khush, and W\. F\. Parker \(2024\)Development and validation of a risk score predicting death without transplant in adult heart transplant candidates\.Journal of the American Medical Association \(JAMA\)331\(6\),pp\. 500–509\.Cited by:[§1](https://arxiv.org/html/2606.02671#S1.p3.1),[§1](https://arxiv.org/html/2606.02671#S1.p4.1)\.
- I\. Zilberstein, I\. Anagnostides, Z\. W\. Sollie, A\. Kilic, and T\. Sandholm \(2026a\)Learning potentials for dynamic matching and application to heart transplantation\.arXiv preprint arXiv:2602\.08878\.Cited by:[Appendix A](https://arxiv.org/html/2606.02671#A1.SS0.SSS0.Px5.p2.1)\.
- I\. Zilberstein, I\. Anagnostides, Z\. W\. Sollie, A\. Kilic, and T\. Sandholm \(2026b\)Near\-optimal dynamic matching via coarsening with application to heart transplantation\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§1](https://arxiv.org/html/2606.02671#S1.p4.1)\.

## Appendix ARelated work

#### Decision\-focused learning and end\-to\-end learning

A closely related line of work is*decision\-focused learning \(DFL\)*\. As in our paper, the DFL framework is motivated by the fact that, in machine learning, optimization is typically based on estimators\. These two pieces are often treated in isolation by typical approaches\[Wilderet al\.,[2019](https://arxiv.org/html/2606.02671#bib.bib65)\]\. Specifically, a predictive model is first trained through some measure of accuracy, for example, mean squared error\. Then that model’s predictions are given as input to an optimization algorithm in order to make a decision\. While this two\-stage approach is justified when the predictive model is highly accurate, in complex tasks state\-of\-the\-art models will inevitably be imperfect\. The training process involves tradeoffs as to the nature of such errors\. When prediction and optimization are treated in isolation, it creates a critical misalignment\.

To address this misalignment,Dontiet al\.\[[2017](https://arxiv.org/html/2606.02671#bib.bib57)\]proposed an end\-to\-end approach for learning ML models in a way that directly captures the final task\-based objectives for which the models will be used; this approach is referred to as*end\-to\-end model learning*\. Similarly, decision\-focused learning endeavors to align decision and learning\[Wilderet al\.,[2019](https://arxiv.org/html/2606.02671#bib.bib65), Elmachtoub and Grigas,[2022](https://arxiv.org/html/2606.02671#bib.bib64), Mandiet al\.,[2024](https://arxiv.org/html/2606.02671#bib.bib63), Capitaineet al\.,[2025](https://arxiv.org/html/2606.02671#bib.bib62)\]\. This was pioneered byWilderet al\.\[[2019](https://arxiv.org/html/2606.02671#bib.bib65)\], who studied certain classes of combinatorial optimization problems, and is also known as integrated decision learning\[Tao and Xu,[2026](https://arxiv.org/html/2606.02671#bib.bib61)\]\. Closer to our work is the recent paper ofCapitaineet al\.\[[2025](https://arxiv.org/html/2606.02671#bib.bib62)\], which examines DFL in dynamic environments where the objective and data distribution evolve over time\. For a survey on this rapidly growing line of work, we refer toMandiet al\.\[[2024](https://arxiv.org/html/2606.02671#bib.bib63)\]\.

There are two key challenges in applying those prior frameworks in our setting: i\) the presence of right censorship, which significantly complicates statistical estimation, and ii\) the objective we want to optimize, namelyNDCG​@​1\\text\{NDCG\}@\{1\}, is highly sensitive to distribution shifts\.

#### Algorithms with predictions

There is a flourishing line of work on algorithm design through the use of*predictions*\[Mitzenmacher and Vassilvitskii,[2022](https://arxiv.org/html/2606.02671#bib.bib14)\], where the goal is to improve performance under reliable estimators and revert back to worst\-case performance when the predictors are inaccurate\. Heart transplant allocation can also be viewed from that perspective\. As motivated above, the interplay between prediction quality and algorithm design is at the core of our approach, driven by the observation that the estimation part needs to be informed by the policy optimization component\. These two are typically treated in isolation in prior papers in the line of work on algorithms with predictions\.

#### Calibration

Prediction that aligns with downstream decision tasks can also be accomplished through*calibration*\[Dawid,[1982](https://arxiv.org/html/2606.02671#bib.bib58), Wang,[2023](https://arxiv.org/html/2606.02671#bib.bib60), Foster and Vohra,[1998](https://arxiv.org/html/2606.02671#bib.bib10)\], as it effectively allows treating the estimated quantities as probabilities; exploring how calibration can be used in cases where there is right censorship is an interesting direction for future work\. To connect with the line of work on algorithms with predictions, recent work byShenet al\.\[[2025](https://arxiv.org/html/2606.02671#bib.bib38)\]studied the ski rental problem under calibrated predictors\.

Moreover, the interaction between estimation and downstream decisions is also central in the framework of performative prediction\[Perdomoet al\.,[2020](https://arxiv.org/html/2606.02671#bib.bib59)\], where the underlying distribution in the estimation part shifts due to the strategic decisions of the population\.

#### Recommender systems

Another closely related line of research is that of recommender systems\[Burkeet al\.,[2011](https://arxiv.org/html/2606.02671#bib.bib114), Wanget al\.,[2013](https://arxiv.org/html/2606.02671#bib.bib105), Rossettiet al\.,[2016](https://arxiv.org/html/2606.02671#bib.bib113), Järvelin and Kekäläinen,[2002](https://arxiv.org/html/2606.02671#bib.bib106)\], which focuses on identifying the optimal items to present to a user\. Learning\-to\-rank\[Caoet al\.,[2007](https://arxiv.org/html/2606.02671#bib.bib4), Burges,[2010](https://arxiv.org/html/2606.02671#bib.bib42)\]is a foundational technique for training a recommender system, and is the underlying learning objective of our bootstrapping approach\. While the connection between learning\-to\-rank and allocation has been explored in domains such as healthcare resource management\[Kamranet al\.,[2024](https://arxiv.org/html/2606.02671#bib.bib43)\]and ad auctions\[Richardsonet al\.,[2007](https://arxiv.org/html/2606.02671#bib.bib48)\], the formal link between ranking metrics and downstream allocation utility has not been previously established\.

#### Survival prediction and organ allocation

Survival analysis, a specialized field of time\-to\-event prediction, focuses on modeling the distribution of event times under conditions of right censorship\[Kaplan and Meier,[1958](https://arxiv.org/html/2606.02671#bib.bib100), Wei,[1992](https://arxiv.org/html/2606.02671#bib.bib101)\]\. Unlike standard regression tasks, the presence of censoring means that for many subjects, we only possess a lower bound on the true event time, rather than an exact observation\. Cox regression\[Cox,[1972](https://arxiv.org/html/2606.02671#bib.bib23)\]is one of the foundational statistical models for survival analysis\. Cox models, which are often evaluated via concordance, are commonly used in organ allocation\[Papalexopouloset al\.,[2024](https://arxiv.org/html/2606.02671#bib.bib54), OPTN,[2025](https://arxiv.org/html/2606.02671#bib.bib39), Dickerson and Sandholm,[2015](https://arxiv.org/html/2606.02671#bib.bib6)\]\. More recently, deep learning solutions have been developed to extend beyond semi\-parametric models for survival analysis\[Leeet al\.,[2018](https://arxiv.org/html/2606.02671#bib.bib28), Nagpalet al\.,[2021](https://arxiv.org/html/2606.02671#bib.bib104), Katzmanet al\.,[2018](https://arxiv.org/html/2606.02671#bib.bib102)\]\.

The application of matching donor organs to patients for transplantation is a high\-stakes application of machine learning and matching algorithms\[Su and Zenios,[2004](https://arxiv.org/html/2606.02671#bib.bib56), Abrahamet al\.,[2007](https://arxiv.org/html/2606.02671#bib.bib91), Awasthi and Sandholm,[2009](https://arxiv.org/html/2606.02671#bib.bib73), Dickerson and Sandholm,[2015](https://arxiv.org/html/2606.02671#bib.bib6), Dickersonet al\.,[2016](https://arxiv.org/html/2606.02671#bib.bib92), Berrevoetset al\.,[2020](https://arxiv.org/html/2606.02671#bib.bib1),[2021](https://arxiv.org/html/2606.02671#bib.bib15), Anagnostideset al\.,[2026](https://arxiv.org/html/2606.02671#bib.bib98), Shojaeeet al\.,[2021](https://arxiv.org/html/2606.02671#bib.bib115)\]\. Most data\-driven approaches rely onpredictionsof transplant outcomes\[Berrevoetset al\.,[2020](https://arxiv.org/html/2606.02671#bib.bib1),[2021](https://arxiv.org/html/2606.02671#bib.bib15), Dickerson and Sandholm,[2015](https://arxiv.org/html/2606.02671#bib.bib6), Papalexopouloset al\.,[2024](https://arxiv.org/html/2606.02671#bib.bib54), Shojaeeet al\.,[2021](https://arxiv.org/html/2606.02671#bib.bib115), Zilbersteinet al\.,[2026a](https://arxiv.org/html/2606.02671#bib.bib103)\]\. These existing methods treat the predictive model as a black\-box optimized for aggregate statistical accuracy\. While some work has focused on making the predictions more robust\[Berrevoetset al\.,[2021](https://arxiv.org/html/2606.02671#bib.bib15)\], the predictive objectives still remain decoupled from the combinatorial requirements of the matching algorithm\. Because these predictors are trained in isolation from the downstream matching objective, the resulting allocation mechanisms lack provable bounds on solution quality relative to the true underlying utility\.

#### Conditional average treatment effect estimation

One of the most similar lines of work to ours concerns estimatingconditional average treatment effect \(CATE\), a causal inference problem\. Learning\-to\-rank has also been identified as a core task for CATE when allocating treatments\[Kamranet al\.,[2024](https://arxiv.org/html/2606.02671#bib.bib43), Fernández\-Loría and Provost,[2022](https://arxiv.org/html/2606.02671#bib.bib118)\]\. Similar to our bootstrapping approach, re\-training a ranker based on an underlying predictive model is a common technique\[Vanderschuerenet al\.,[2024](https://arxiv.org/html/2606.02671#bib.bib119), Frauenet al\.,[2025](https://arxiv.org/html/2606.02671#bib.bib117), Arnoet al\.,[2026](https://arxiv.org/html/2606.02671#bib.bib116)\]\. Our work differentiates itself from this line of work in multiple ways\. First, survival analysis revolves around a different predictive task from the causal treatment effects, which is more akin to predicting the life\-years gained\. Prior work on CATE has also not addressed right censorship, a distinguishing challenge of survival analysis\. Finally, in general, these prior works focus on ranking an entire set of patients, whereas our allocation is most concerned with predicting the top\-kkcandidates correctly for very small values ofkk\(i\.e\.,k=1k=1\)\.

## Appendix BOmitted algorithms and proofs

We begin by proving our claims regarding allocation using a predictor to determine edge weights with only a guarantee on the concordance index of the predictor\.

See[1](https://arxiv.org/html/2606.02671#Thmproposition1)

###### Proof\.

LetN≥2N\\geq 2and consider an arbitrary ranking of theNNcandidates determined byf^\\hat\{f\}\. Letπ​\(i\)∈\{1,…,N\}\\pi\(i\)\\in\\\{1,\\dots,N\\\}denote the predicted rank of candidateii, where a lowerπ​\(i\)\\pi\(i\)corresponds to a higher utility\.

Given this predicted ranking, the deterministic policy will select a specific candidate\. We will adversarially construct the true utilities based on the policy’s selection\.

For a small constantϵ\>0\\epsilon\>0, we assign the true utilities,TiT\_\{i\}, as follows\.

- •For the selected candidateii,Ti=ρ−ϵ⋅π​\(i\)T\_\{i\}=\\rho\-\\epsilon\\cdot\\pi\(i\)
- •For theN−1N\-1non\-selected candidatesjj,Tj=1\+ϵ⋅\(N−π​\(j\)\)T\_\{j\}=1\+\\epsilon\\cdot\(N\-\\pi\(j\)\)

The true utility of the selected candidate is less than the utility of any candidate not selected\. The policy achieves a total utility of

U​\(𝖠𝖫𝖦\)=ρ−ϵ⋅π​\(i\)<ρ\.U\(\\mathsf\{ALG\}\)=\\rho\-\\epsilon\\cdot\\pi\(i\)<\\rho\.BecauseN≥2N\\geq 2, the optimal policy will select a candidate not selected by the algorithm, achieving a utility of

U​\(OPT\)=1\+ϵ⋅\(N−π​\(j\)\)\>1\.U\(\\mathrm\{OPT\}\)=1\+\\epsilon\\cdot\(N\-\\pi\(j\)\)\>1\.The ratio of the optimal utility achieved by the policy is therefore upper bounded byρ\\rho\.

We now calculate the C\-index of the predictorf^\\hat\{f\}\. The total number of comparable pairs is\(N2\)\\binom\{N\}\{2\}\. All\(N−12\)\\binom\{N\-1\}\{2\}pairs excluding candidateiiare perfectly ranked\. The incorrectly ranked pairs occur when comparing the selected candidate to the rest\. There are at most\(N−1\)\(N\-1\)such pairs\. The C\-index is bounded below by

1−\(N−1\)\(N2\)=1−2N\.1\-\\frac\{\(N\-1\)\}\{\\binom\{N\}\{2\}\}=1\-\\frac\{2\}\{N\}\.AsN→∞N\\to\\infty, the C\-index approaches11\. We can choose a sufficiently largeN\>2/\(1−c\)N\>2/\(1\-c\)\. For such anNN, the predictor achieves C\-index at leastccwhile the allocation policy is restricted to aρ\\rhofraction of the optimal utility\. ∎

We continue with the proof of[Proposition˜2](https://arxiv.org/html/2606.02671#Thmproposition2)\.

See[2](https://arxiv.org/html/2606.02671#Thmproposition2)

###### Proof\.

Consider any algorithm that assigns a probabilitypi∈\[0,1\]p\_\{i\}\\in\[0,1\]to each candidatei∈\{1,…,N\}i\\in\\\{1,\\dots,N\\\}based on the utility determined byf^\\hat\{f\}\. We assume∑i=1Npi=1\\sum\_\{i=1\}^\{N\}p\_\{i\}=1\. LetΔ\\Deltabe the total probability mass on the candidate with the highest probability\. Letπ​\(i\)∈\{1,…,N\}\\pi\(i\)\\in\\\{1,\\dots,N\\\}denote the predicted rank of candidateii, where a lowerπ​\(i\)\\pi\(i\)corresponds to a higher utility\.

We assign to the candidate with the highest probability a true, unknown utility ofρ−ϵ⋅π​\(i\)\\rho\-\\epsilon\\cdot\\pi\(i\)and a utility of1\+ϵ⋅\(N−π​\(i\)\)1\+\\epsilon\\cdot\(N\-\\pi\(i\)\)to the remainingN−1N\-1candidates for anyρ∈\(0,1\)\\rho\\in\(0,1\)andϵ\>0\\epsilon\>0\.

As seen in the proof of[Proposition˜1](https://arxiv.org/html/2606.02671#Thmproposition1), if the predictor correctly ranks all the patients except the patient with the largest probability of selection, it achieves a C\-index at leastccfor sufficiently largeNNas

1−\(N−1\)\(N2\)=1−2N\.1\-\\frac\{\(N\-1\)\}\{\\binom\{N\}\{2\}\}=1\-\\frac\{2\}\{N\}\.
The expected utility of the algorithm is

𝔼​\[U​\(𝖠𝖫𝖦\)\]=∑i=1Npi​Ti≤Δ⋅ρ\+\(1−Δ\)⋅\(1\+ϵ​N\)=1−Δ​\(1−ρ\)\+\(1−Δ\)​ϵ​N\.\\mathbb\{E\}\[U\(\\mathsf\{ALG\}\)\]=\\sum\_\{i=1\}^\{N\}p\_\{i\}T\_\{i\}\\leq\\Delta\\cdot\\rho\+\(1\-\\Delta\)\\cdot\(1\+\\epsilon N\)=1\-\\Delta\(1\-\\rho\)\+\(1\-\\Delta\)\\epsilon N\.The optimal offline utility isU​\(OPT\)≥1U\(\\mathrm\{OPT\}\)\\geq 1\. The expected fraction of the optimal utility is then

𝔼​\[U​\(𝖠𝖫𝖦\)\]𝔼​\[U​\(OPT\)\]≤1−Δ​\(1−ρ\)\+\(1−Δ\)​ϵ​N1≤1−Δ​\(1−ρ\)\\frac\{\\mathbb\{E\}\[U\(\\mathsf\{ALG\}\)\]\}\{\\mathbb\{E\}\[U\(\\mathrm\{OPT\}\)\]\}\\leq\\frac\{1\-\\Delta\(1\-\\rho\)\+\(1\-\\Delta\)\\epsilon N\}\{1\}\\leq 1\-\\Delta\(1\-\\rho\)asϵ\\epsilontends to0for a fixedNN\.

Sinceρ∈\(0,1\)\\rho\\in\(0,1\), this expected ratio is maximized whenΔ\\Deltais minimized\. The minimum value of the largest probability is achieved when the distribution is uniform\. Any non\-uniform policy necessarily has a largerΔ\\Deltaand can obtain a strictly lower expected ratio of the optimal utility asNNgrows\. ∎

We move on to the proof of[Theorem˜1](https://arxiv.org/html/2606.02671#Thmtheorem1), and provide the pseudocode for[Algorithm˜1](https://arxiv.org/html/2606.02671#alg1)\. Recall that we assume non\-negative weights\(Ti≥0​∀i\)\(T\_\{i\}\\geq 0~~\\forall\_\{i\}\)as in our application patients cannot have passed away in the past\.

Algorithm 1Position\-weighted allocation \(𝖯𝖶𝖠\\mathsf\{PWA\}\)0:

NNcandidates, predictor

f^\\hat\{f\}with

NDCG​@​k≥α\\text\{NDCG\}@\{k\}\\geq\\alpha\.

1:Order the

NNcandidates according to

f^\\hat\{f\}\.

2:Set

wi=1/log2⁡\(i\+1\)w\_\{i\}=\\nicefrac\{\{1\}\}\{\{\\log\_\{2\}\(i\+1\)\}\}and

Wk=∑i=1kwiW\_\{k\}=\\sum\_\{i=1\}^\{k\}w\_\{i\}\.

3:Randomly allocate to one of the top

kkranked candidates proportional to

wi/Wk\\nicefrac\{\{w\_\{i\}\}\}\{\{W\_\{k\}\}\}\.

See[1](https://arxiv.org/html/2606.02671#Thmtheorem1)

###### Proof\.

Letπ​\(i\)∈\{1,…,N\}\\pi\(i\)\\in\\\{1,\\dots,N\\\}denote the index of the candidate at rankiiordered byf^\\hat\{f\}\. We denote the true utility of candidateiiasTiT\_\{i\}and assume without loss of generality thatTi≥TjT\_\{i\}\\geq T\_\{j\}for alli≤ji\\leq j\.𝖯𝖶𝖠\\mathsf\{PWA\}selects ranki∈\[1,k\]i\\in\[1,k\]with probabilitywi/Wkw\_\{i\}/W\_\{k\}\. The expected utility of𝖯𝖶𝖠\\mathsf\{PWA\}is

𝔼​\[𝖯𝖶𝖠\]=∑i=1kwiWk​Tπ​\(i\)=DCG​@​kWk\.\\mathbb\{E\}\[\\mathsf\{PWA\}\]=\\sum\_\{i=1\}^\{k\}\\frac\{w\_\{i\}\}\{W\_\{k\}\}T\_\{\\pi\(i\)\}=\\frac\{\\text\{DCG\}@\{k\}\}\{W\_\{k\}\}\.
SinceNDCG​@​k≥α\\text\{NDCG\}@\{k\}\\geq\\alpha,DCG​@​k≥α​IDCG​@​k\\text\{DCG\}@\{k\}\\geq\\alpha\\text\{IDCG\}@\{k\}, so

𝔼​\[𝖯𝖶𝖠\]≥α​IDCG​@​kWk\.\\mathbb\{E\}\[\\mathsf\{PWA\}\]\\geq\\frac\{\\alpha\\text\{IDCG\}@\{k\}\}\{W\_\{k\}\}\.The worst case occurs whenIDCG​@​k\\text\{IDCG\}@\{k\}is minimized\. SinceT1T\_\{1\}is the utility of the best candidate \(and also the utility ofOPT\\mathrm\{OPT\}\),

IDCG​@​k=T1\+∑i=2kTi​wi\.\\text\{IDCG\}@\{k\}=T\_\{1\}\+\\sum\_\{i=2\}^\{k\}T\_\{i\}w\_\{i\}\.This quantity is minimized exactly whenTi=0T\_\{i\}=0for alli∈\[2,k\]i\\in\[2,k\]since the weights are non\-negative\. This completes the proof as

𝔼​\[𝖯𝖶𝖠\]≥α​T1Wk≥αWk​U​\(OPT\)\.\\mathbb\{E\}\[\\mathsf\{PWA\}\]\\geq\\frac\{\\alpha T\_\{1\}\}\{W\_\{k\}\}\\geq\\frac\{\\alpha\}\{W\_\{k\}\}U\(\\mathrm\{OPT\}\)\.
∎

The proof of[Corollary˜1](https://arxiv.org/html/2606.02671#Thmcorollary1)follows immediately sinceW1=1W\_\{1\}=1\. Next, we prove the unbiasedness of our NDCG estimators\.

See[1](https://arxiv.org/html/2606.02671#Thmproperty1)

###### Proof\.

We can write the expected value ofT^iEY\\hat\{T\}^\{\\text\{EY\}\}\_\{i\}based on the event indicatorδi\\delta\_\{i\},

𝔼​\[T^iEY∣Xi\]=𝔼​\[δi​Ti∣Xi\]\+𝔼​\[\(1−δi\)​𝔼S^​\[Ti∗​∣Ti∗\>​Ti,Xi\]∣Xi\]\.\\mathbb\{E\}\[\\hat\{T\}^\{\\text\{EY\}\}\_\{i\}\\mid X\_\{i\}\]=\\mathbb\{E\}\[\\delta\_\{i\}T\_\{i\}\\mid X\_\{i\}\]\+\\mathbb\{E\}\\left\[\(1\-\\delta\_\{i\}\)\\mathbb\{E\}\_\{\\hat\{S\}\}\[T^\{\*\}\_\{i\}\\mid T^\{\*\}\_\{i\}\>T\_\{i\},X\_\{i\}\]\\mid X\_\{i\}\\right\]\.We consider the two terms and values ofδi\\delta\_\{i\}\.

1. 1\.δi=1\\delta\_\{i\}=1: The event is fully observed, soTi=Ti∗T\_\{i\}=T^\{\*\}\_\{i\}which implies𝔼​\[δi​Ti∣Xi\]=𝔼​\[δi​Ti∗∣Xi\]\\mathbb\{E\}\[\\delta\_\{i\}T\_\{i\}\\mid X\_\{i\}\]=\\mathbb\{E\}\[\\delta\_\{i\}T^\{\*\}\_\{i\}\\mid X\_\{i\}\]\.
2. 2\.δi=0\\delta\_\{i\}=0: The observation is censored, soTi∗\>TiT^\{\*\}\_\{i\}\>T\_\{i\}\. Under Assumption[1](https://arxiv.org/html/2606.02671#Thmassumption1),δi=0\\delta\_\{i\}=0does not influence the survival timeTi∗T^\{\*\}\_\{i\}other than the survival up to the censoring time\. It follows from Assumption[2](https://arxiv.org/html/2606.02671#Thmassumption2)and the Law of Iterated Expectation, that 𝔼​\[\(1−δi\)​𝔼S^​\[Ti∗​∣Ti∗\>​Ti,Xi\]∣Xi\]=𝔼​\[\(1−δi\)​Ti∗∣Xi\]\.\\mathbb\{E\}\\left\[\(1\-\\delta\_\{i\}\)\\mathbb\{E\}\_\{\\hat\{S\}\}\[T^\{\*\}\_\{i\}\\mid T^\{\*\}\_\{i\}\>T\_\{i\},X\_\{i\}\]\\mid X\_\{i\}\\right\]=\\mathbb\{E\}\[\(1\-\\delta\_\{i\}\)T^\{\*\}\_\{i\}\\mid X\_\{i\}\]\.

Combining both cases,

𝔼​\[δi​Ti∗∣Xi\]\+𝔼​\[\(1−δi\)​Ti∗∣Xi\]=𝔼​\[\(δi\+1−δi\)​Ti∗∣Xi\]=𝔼​\[Ti∗∣Xi\]\.\\mathbb\{E\}\[\\delta\_\{i\}T^\{\*\}\_\{i\}\\mid X\_\{i\}\]\+\\mathbb\{E\}\[\(1\-\\delta\_\{i\}\)T^\{\*\}\_\{i\}\\mid X\_\{i\}\]=\\mathbb\{E\}\\left\[\(\\delta\_\{i\}\+1\-\\delta\_\{i\}\)T^\{\*\}\_\{i\}\\mid X\_\{i\}\\right\]=\\mathbb\{E\}\[T^\{\*\}\_\{i\}\\mid X\_\{i\}\]\.∎

Continuing with the IPCW estimator\.

See[2](https://arxiv.org/html/2606.02671#Thmproperty2)

###### Proof\.

Note thatδi=𝕀​\{Ci≥Ti∗\}\\delta\_\{i\}=\\mathbb\{I\}\\\{C\_\{i\}\\geq T^\{\*\}\_\{i\}\\\}andTi=Ti∗T\_\{i\}=T^\{\*\}\_\{i\}whenδi=1\\delta\_\{i\}=1\. Since the numerator evaluates to zero whenδi=0\\delta\_\{i\}=0, we can substituteTi∗T^\{\*\}\_\{i\}forTiT\_\{i\}inside the expectation without changing the expected value\. Under Assumption[1](https://arxiv.org/html/2606.02671#Thmassumption1)and the Law of Iterated Expectation, we get

𝔼T∗,C,X​\[𝕀​\{Ci≥Ti∗\}​Ti∗G^​\(Ti∗∣Xi\)\|Xi\]=𝔼T∗,X​\[𝔼C​\[𝕀​\{Ci≥Ti∗\}​Ti∗G^​\(Ti∗∣Xi\)\|Ti∗,Xi\]\|Xi\]\.\\mathbb\{E\}\_\{T^\{\*\},C,X\}\\left\[\\frac\{\\mathbb\{I\}\\\{C\_\{i\}\\geq T^\{\*\}\_\{i\}\\\}T^\{\*\}\_\{i\}\}\{\\hat\{G\}\(T^\{\*\}\_\{i\}\\mid X\_\{i\}\)\}~\\Bigg\|~X\_\{i\}\\right\]=\\mathbb\{E\}\_\{T^\{\*\},X\}\\left\[\\mathbb\{E\}\_\{C\}\\left\[\\frac\{\\mathbb\{I\}\\\{C\_\{i\}\\geq T^\{\*\}\_\{i\}\\\}T^\{\*\}\_\{i\}\}\{\\hat\{G\}\(T^\{\*\}\_\{i\}\\mid X\_\{i\}\)\}~\\Bigg\|~T^\{\*\}\_\{i\},X\_\{i\}\\right\]~\\Bigg\|~X\_\{i\}\\right\]\.SinceTi∗T^\{\*\}\_\{i\}is constant relative to the inner expectation overCC, we factor it out, giving

𝔼T∗,X​\[Ti∗⋅𝔼C​\[𝕀​\{Ci≥Ti∗\}G^​\(Ti∗∣Xi\)\|Ti∗,Xi\]\|Xi\]\.\\mathbb\{E\}\_\{T^\{\*\},X\}\\left\[T^\{\*\}\_\{i\}\\cdot\\mathbb\{E\}\_\{C\}\\left\[\\frac\{\\mathbb\{I\}\\\{C\_\{i\}\\geq T^\{\*\}\_\{i\}\\\}\}\{\\hat\{G\}\(T^\{\*\}\_\{i\}\\mid X\_\{i\}\)\}~\\Bigg\|~T^\{\*\}\_\{i\},X\_\{i\}\\right\]~\\Bigg\|~X\_\{i\}\\right\]\.By Assumption[3](https://arxiv.org/html/2606.02671#Thmassumption3), the inner expectation equals 1\. Therefore,𝔼T∗,X​\[Ti∗⋅1∣Xi\]=𝔼​\[Ti∗∣Xi\]\\mathbb\{E\}\_\{T^\{\*\},X\}\[T^\{\*\}\_\{i\}\\cdot 1\\mid X\_\{i\}\]=\\mathbb\{E\}\[T^\{\*\}\_\{i\}\\mid X\_\{i\}\]\.

∎

## Appendix CFurther discussion of NDCG estimators

The restricted EY estimator isT^i\(τ\),EY=δi\(τ\)​Ti\(τ\)\+\(1−δi\(τ\)\)​𝔼S^​\[Ti\(τ\),∗​∣Ti∗\>​Ti,Xi\]\\hat\{T\}^\{\(\\tau\),\\text\{EY\}\}\_\{i\}=\\delta^\{\(\\tau\)\}\_\{i\}T^\{\(\\tau\)\}\_\{i\}\+\\left\(1\-\\delta^\{\(\\tau\)\}\_\{i\}\\right\)\\mathbb\{E\}\_\{\\hat\{S\}\}\\left\[T^\{\(\\tau\),\*\}\_\{i\}\\mid T^\{\*\}\_\{i\}\>T\_\{i\},X\_\{i\}\\right\]\.

If we again assume the conditional survival function is conditionally unbiased, we obtain that the restricted estimator provides conditionally unbiased estimates of the relevance\.

###### Assumption 4\(Restricted unbiasedness ofS^\\hat\{S\}\)\.

The conditional survival functionS^\\hat\{S\}is conditionally unbiased such that the expected value of the restricted estimated survival times match the true restricted survival time\. That is,𝔼​\[𝔼S^​\[Ti\(τ\),∗​∣Ti∗\>​Ti,Xi\]\]=𝔼​\[T\(τ\),∗​∣Ti∗\>​Ti,Xi\]\\mathbb\{E\}\\left\[\\mathbb\{E\}\_\{\\hat\{S\}\}\[T^\{\(\\tau\),\*\}\_\{i\}\\mid T^\{\*\}\_\{i\}\>T\_\{i\},X\_\{i\}\]\\right\]=\\mathbb\{E\}\[T^\{\(\\tau\),\*\}\\mid T^\{\*\}\_\{i\}\>T\_\{i\},X\_\{i\}\]\.

###### Property 3\(Unbiasedness of restricted EY estimator\)\.

Under Assumptions[1](https://arxiv.org/html/2606.02671#Thmassumption1)and[4](https://arxiv.org/html/2606.02671#Thmassumption4),T^i\(τ\),EY\\hat\{T\}^\{\(\\tau\),\\text\{EY\}\}\_\{i\}is a conditionally unbiased estimator of𝔼​\[Ti\(τ\),∗∣Xi\]\\mathbb\{E\}\\left\[T^\{\(\\tau\),\*\}\_\{i\}\\mid X\_\{i\}\\right\]\.

An identical argument to the proof of[Property˜1](https://arxiv.org/html/2606.02671#Thmproperty1)holds for the proof of[Property˜3](https://arxiv.org/html/2606.02671#Thmproperty3)\.

The restricted IPCW estimator,T^i\(τ\),IPCW=δi\(τ\)​Ti\(τ\)G^​\(Ti\(τ\)∣Xi\)\\hat\{T\}^\{\(\\tau\),\\text\{IPCW\}\}\_\{i\}=\\frac\{\\delta\_\{i\}^\{\(\\tau\)\}T\_\{i\}^\{\(\\tau\)\}\}\{\\hat\{G\}\(T\_\{i\}^\{\(\\tau\)\}\\mid X\_\{i\}\)\}, is also unbiased by an identical argument to the unrestricted one\.

###### Property 4\(Unbiasedness of restricted IPCW estimator\)\.

Under Assumptions[1](https://arxiv.org/html/2606.02671#Thmassumption1)and[3](https://arxiv.org/html/2606.02671#Thmassumption3),T^i\(τ\),IPCW\\hat\{T\}^\{\(\\tau\),\\text\{IPCW\}\}\_\{i\}is a conditionally unbiased estimator of𝔼​\[Ti\(τ\),∗∣Xi\]\\mathbb\{E\}\\left\[T\_\{i\}^\{\(\\tau\),\*\}\\mid X\_\{i\}\\right\]\.

As with the restricted case for the EY estimator, an identical argument holds for the proof of[Property˜4](https://arxiv.org/html/2606.02671#Thmproperty4)\.

SinceG^​\(t∣X\)→0\\hat\{G\}\(t\\mid X\)\\to 0in the tail, IPCW weights can grow arbitrarily large\. Restricting the horizonτ\\taubounds the maximum weight at1/G^​\(τ∣X\)1/\\hat\{G\}\(\\tau\\mid X\)and is a practical solution for stability\. Another advantage of the restricted case of IPCW is that we only require calibration over the horizonτ\\tau\. In Assumption[3](https://arxiv.org/html/2606.02671#Thmassumption3), we also require thatG^​\(t∣X\)\>0\\hat\{G\}\(t\\mid X\)\>0to avoid division by0\. In practice, this is not guaranteed in the unrestricted setting, but can be guaranteed in the restricted case\.

IPCW has a number of drawbacks\. It assigns a weight of zero to any data point censored beforeτ\\tau, discarding a significant portion of the dataset\. In clinical settings with heavy censoring, this reduces the sample size of the evaluation metric\. Although the horizonτ\\tautheoretically bounds the maximum weight, patients with a high probability of being censored still receive large inverse weights\. This introduces high variance into the evaluation metric, meaning a single heavily weighted patient can entirely dominate the DCG score\.

## Appendix DFurther details of baseline survival predictors

We present each baseline survival predictor in more detail\.

- •Kaplan\-Meier estimator \(KM\)\[Kaplan and Meier,[1958](https://arxiv.org/html/2606.02671#bib.bib100)\]is a non\-parametric model that predicts the survival function directly from empirical data\. It computes the survival probability at timettas the product of conditional survival probabilities at all observed event timesti≤tt\_\{i\}\\leq t, defined as S^​\(t\)=∏ti≤t\(1−dini\)\\hat\{S\}\(t\)=\\prod\_\{t\_\{i\}\\leq t\}\\left\(1\-\\frac\{d\_\{i\}\}\{n\_\{i\}\}\\right\)wheredid\_\{i\}is the number of events occurring at timetit\_\{i\}, andnin\_\{i\}is the number of patients at risk prior to timetit\_\{i\}\. The KM estimator does not use any covariates in its prediction and therefore does not explicitly satisfy the assumptions required to achieve an unbiased estimate of relevance\. We still include it as a baseline as it is commonly used in the clinical setting\.
- •Cox regression \(Cox\)\[Cox,[1972](https://arxiv.org/html/2606.02671#bib.bib23)\]uses a semi\-parametric model relying on the proportional hazards assumption which state that the hazard function can be expressed as h​\(t∣x\)=h0​\(t\)​exp⁡\(θ⊤​x\)h\(t\\mid x\)=h\_\{0\}\(t\)\\exp\(\\theta^\{\\top\}x\)whereh0​\(t\)h\_\{0\}\(t\)is the baseline hazard function\. The baseline hazard is typically fit using the Breslow estimator\[Breslow,[1975](https://arxiv.org/html/2606.02671#bib.bib26)\]\. The survival function,S​\(t∣x\)S\(t\\mid x\)is computed as S​\(t∣x\)=exp⁡\(−exp⁡\(θ⊤​x\)​∫z=0th0​\(z\)​𝑑z\)\.S\(t\\mid x\)=\\exp\\left\(\-\\exp\(\\theta^\{\\top\}x\)\\int\_\{z=0\}^\{t\}h\_\{0\}\(z\)dz\\right\)\.
- •Accelerated Failure Time \(AFT\)\[Wei,[1992](https://arxiv.org/html/2606.02671#bib.bib101)\]is a fully parametric model that assumes the effect of covariates is to accelerate \(or decelerate\) the time to an event\. It models the logarithm of the survival timeTTas a linear function of the covariatesxx, log⁡\(T\)=θ⊤​x\+σ​ϵ\\log\(T\)=\\theta^\{\\top\}x\+\\sigma\\epsilonwhereθ\\thetais the vector of coefficients,σ\\sigmais a scale parameter, andϵ\\epsilonis an error term following an underlying assumed distribution \(typically the Weibull distribution\)\. The survival functionS​\(t∣x\)S\(t\\mid x\)is related to the baseline survival functionS0​\(t\)S\_\{0\}\(t\)by S​\(t∣x\)=S0​\(t​exp⁡\(−θ⊤​x\)\)\.S\(t\\mid x\)=S\_\{0\}\(t\\exp\(\-\\theta^\{\\top\}x\)\)\.
- •DeepSurv\[Katzmanet al\.,[2018](https://arxiv.org/html/2606.02671#bib.bib102)\]is a non\-linear extension of Cox regression that uses deep neural networks to model the proportional hazards\. DeepSurv replaces the linear combination of covariates with the scalar output of a deep neural network,fθ​\(x\)f\_\{\\theta\}\(x\)\. The hazard function is modeled as h​\(t∣x\)=h0​\(t\)​exp⁡\(fθ​\(x\)\)\.h\(t\\mid x\)=h\_\{0\}\(t\)\\exp\(f\_\{\\theta\}\(x\)\)\.
- •DeepHit\[Leeet al\.,[2018](https://arxiv.org/html/2606.02671#bib.bib28)\]is a discrete\-time survival model that relaxes the proportional hazards assumption\. DeepHit discretizes the time horizon into intervals and uses a deep neural network to estimate the probability of the event occurring within each interval\.

## Appendix EModel and training configurations

We detail the hyperparameters and architectural configurations for all models used in our experiments\. All models are implemented in Python 3\.11, leveraging common implementations from thescikit\-survival,lifelines, andpycoxlibraries\. We preprocess the dataset in the same way for all models\. We impute missing features using median imputation, scale features using a standard scaler, and leverage a one\-hot encoding of categorical features\. All experiments are conducted on an M4 Pro processor with 24GB unified memory, and terminate within hours on this processor\. We report the final hyperparameters used in experiments following tuning\.

### E\.1Our bootstrapping approach

Our approach utilizes aGradient Boosted Decision Tree \(GBDT\)implemented via the LightGBM library, optimized with a custom objective detailed in[Section˜5](https://arxiv.org/html/2606.02671#S5)\. We present the necessary hyperparameters for reproducibility in[Table˜A1](https://arxiv.org/html/2606.02671#A5.T1)\. Note training also uses early stopping using a validation set\.

Table A1:Bootstrapped model hyperparameters\.
### E\.2Deep neural networks

DeepSurv and DeepHit are implemented using thepycoxlibrary configured using the values in[Table˜A2](https://arxiv.org/html/2606.02671#A5.T2)\. Training also uses early stopping based on the performance on the validation set\.

Table A2:DeepSurv and DeepHit configurations\.
### E\.3Statistical baselines

The statistical baselines are implemented from standard libraries withL2L\_\{2\}regularization to prevent overfitting shown in[Table˜A3](https://arxiv.org/html/2606.02671#A5.T3)\.

Table A3:Statistical model parameters\.

## Appendix FFurther details of the experimental setup

In the artificial censoring experiment, for patients selected for censoring, we draw a censoring timeCi∈\[0,Ti∗\]C\_\{i\}\\in\[0,T\_\{i\}^\{\*\}\]uniformly at random\. We conduct 5 iterations of 5\-fold cross\-validation\. In each iteration, the dataset is independently re\-censored\. We cross\-fit the nuisance models within each test fold to avoid overfitting to the observed outcomes and to ensure that the nuisance estimates used for evaluation are generated independently of the data used to train the predictors\.

For the full experiment on the UNOS registry, we again conduct 5 iterations of 5\-fold cross\-validation using the same cross\-fitting setup\.

## Appendix GOmitted tables and figures

Table A4:Summary of UNOS dataset characteristics and feature composition\.Table A5:Distribution of follow\-up times in years across the overall cohort, observed events, and censored patients\.![Refer to caption](https://arxiv.org/html/2606.02671v1/x4.png)Figure A1:Mean absolute error in NDCG estimation\.![Refer to caption](https://arxiv.org/html/2606.02671v1/x5.png)\(a\)Estimation of NDCG@1\.
![Refer to caption](https://arxiv.org/html/2606.02671v1/x6.png)\(b\)Estimation of NDCG@5\.
![Refer to caption](https://arxiv.org/html/2606.02671v1/x7.png)\(c\)Estimation of NDCG@10\.
![Refer to caption](https://arxiv.org/html/2606.02671v1/x8.png)\(d\)Estimation of NDCG@50\.
![Refer to caption](https://arxiv.org/html/2606.02671v1/x9.png)\(e\)Estimation of NDCG@100\.
![Refer to caption](https://arxiv.org/html/2606.02671v1/x10.png)\(f\)Estimation of NDCG\.

Figure A2:Ground\-truth versus estimated NDCG@kk\. Each data point is an individual estimate for a model and estimator\. We report the Spearman rank correlation coefficient,ρ\\rho, for each estimator\.Table A6:Model pairwise accuracy by estimator\.Table A7:Performance of models from the EY NDCG estimator\. We report the average over all nuisance models for each estimator\.Table A8:Performance of models from the IPCW NDCG estimator\. We report the average over all nuisance models for each estimator\.

Similar Articles