A Transferable Learned Temporal Prior for Transmission Reconstruction and Decision-Relevant Uncertainty in Real Outbreak Labels
Summary
This paper presents a transferable learned temporal prior for outbreak transmission reconstruction, demonstrating improved performance on a real Andes virus benchmark and highlighting the importance of quantifying uncertainty in transmission labels.
View Cached Full Text
Cached at: 07/01/26, 05:32 AM
# A Transferable Learned Temporal Prior for Transmission Reconstruction and Decision-Relevant Uncertainty in Real Outbreak Labels
Source: [https://arxiv.org/html/2606.30842](https://arxiv.org/html/2606.30842)
[![[Uncaptioned image]](https://arxiv.org/html/2606.30842v1/x1.png)Md Ahsan Karim](https://orcid.org/0009-0001-5024-8448) Department of Computer Science and Engineering, National Institute of Textile Engineering and Research \(NITER\) Nayarhat, Savar, Dhaka\-1350, Bangladesh makarim11@niter\.edu\.bd
###### Abstract
Outbreak transmission reconstruction treats epidemiological timing and transmission labels as deterministic ground truth; neither has been systematically evaluated\. We trained a logistic regression temporal prior on eleven disease families, locked all parameters before accessing any target outbreak data, and applied it without refitting to a strict Andes virus \(ANDV\) parent\-ranking benchmark of 29 tasks\. The locked prior achieved mean reciprocal rank \(MRR\) 0\.571 versus 0\.274 and Top\-1 accuracy 37\.9% versus 13\.8% against the best source\-trained parametric baseline \(permutationp≤0\.0002p\\leq 0\.0002; 7–8 reversals to lose MRR significance\)\. A phylogenetic concordance audit of 75 NYC mpox inter\-host pairs—independent label\-reliability evidence rather than a prior validation—found that 54\.67% \(exact 95% CI: 42\.75–66\.21%\) were genomically unresolved or unsupported\. Retaining uncertain edges in ANDV and Guangdong Delta graphs shifted top\-5 source\-priority sets \(Jaccard 0\.429–0\.667\)\. Transmission\-label uncertainty was measurable in the outbreak evidence modules examined, and retaining uncertain links changed which source cases were prioritized for intervention\.
*Keywords*Transmission Reconstruction; Temporal Priors; Uncertainty Quantification; Outbreak Epidemiology; Parent\-Ranking Benchmark; Zero\-Shot Transfer
## 1Introduction
Reconstructing who infected whom ranks among the most operationally urgent tasks in outbreak response\. When a new case is confirmed, contact tracers must evaluate all prior cases and prioritize the most probable source: typically within hours, before genomic sequencing is complete, before contact networks are compiled, and before the outbreak’s serial\-interval distribution can be estimated\. The challenge is not hypothetical\. Person\-to\-person Andes virus transmission across four generations in a rural Argentine community, a Sudan virus disease outbreak in Uganda with documented epidemiological transmission evidence, and an accelerating mpox epidemic in a densely connected urban population each demanded source attribution under exactly these conditions\[[1](https://arxiv.org/html/2606.30842#bib.bib11),[15](https://arxiv.org/html/2606.30842#bib.bib29),[17](https://arxiv.org/html/2606.30842#bib.bib18)\]\. In each case the fundamental question was identical: given immediately observable information \(symptom onset dates, documented contacts, and inferred exposure windows\), which cases should be prioritized?
Existing methods address this problem under two distinct evidence regimes\. Genome\-integrated frameworks such asoutbreaker2\[[2](https://arxiv.org/html/2606.30842#bib.bib16)\], SCOTTI\[[7](https://arxiv.org/html/2606.30842#bib.bib8)\], and epi\-genomic integration approaches\[[9](https://arxiv.org/html/2606.30842#bib.bib14),[8](https://arxiv.org/html/2606.30842#bib.bib15)\]reconstruct transmission trees by jointly modeling sequence evolution, phylogenetic uncertainty, and epidemiological timing\. These methods are powerful when multiple high\-quality pathogen sequences are available per case and within\-host diversity provides discriminative signal\. Parametric timing methods offer a complementary approach: they fit Gaussian, Gamma, or Lognormal distributions to historical serial\-interval data and score candidate infectors from the resulting likelihoods\. Both method families share a limitation that has received insufficient attention: they assume the epidemiological labels used for training and evaluation constitute clean, complete, deterministic ground truth\. A systematic audit of 134,095 records from the Global\.Health public outbreak repository, conducted as part of this study, recovered only 26 transmission edges and zero usable parent\-ranking tasks under strict benchmark construction criteria\. Public outbreak data are rarely structured to support rigorous reconstruction benchmarking, and the labels that do exist carry unquantified uncertainty\[[14](https://arxiv.org/html/2606.30842#bib.bib10),[11](https://arxiv.org/html/2606.30842#bib.bib13)\]\.
This study addresses both limitations\. We learn a transferable temporal transmission prior from a multi\-disease benchmark spanning eleven disease families, then lock all parameters before any target outbreak data are accessed\. We validate this locked prior on a strict parent\-ranking benchmark constructed from real Andes virus person\-to\-person transmission data, the densest source of documented directional ANDV transmission in the published literature\. The locked prior substantially outperforms all four fair source\-trained parametric temporal baselines without target\-specific refitting\. A pilot evaluation on a reconstructed Sudan virus disease transmission network confirmed that temporal gap proximity carries discriminative ranking signal under relative\-time benchmark conditions; validating the locked prior on SVD requires a dedicated absolute\-onset benchmark, which this study does not provide\. We audit a published epi\-genomic resource for the 2022 New York City mpox outbreak to test whether epidemiological transmission labels can be treated as ground truth\. The majority of inter\-host linked pairs prove either genomically unresolved or unsupported as direct transmission events\. Using a densely traced Guangdong Delta outbreak transmission graph, we show that retaining edge uncertainty changes inferred source burden, alters outbreak concentration, and shifts fixed\-capacity source prioritization decisions\.
The findings reported here do not imply that all outbreak datasets exhibit the same degree of transmission\-label uncertainty\. Instead, they show that uncertainty can arise at multiple evidentiary levels when transmission links are inferred from exposure proximity, contact interviews, phylogenetic proximity, or graph reconstruction rather than directly observed transmission events\. Across the datasets examined in this study, this uncertainty was measurable and its retention changed inferred source burden, top\-source composition, and fixed\-capacity prioritization decisions\. The central argument is therefore not that uncertainty prevents inference, nor that its prevalence is identical across pathogen families, but that deterministic treatment of uncertain transmission labels can change the conclusions drawn from outbreak reconstruction\.
## 2Related Work
Transmission reconstruction methods have evolved along two largely parallel tracks: those that exploit genomic sequence data to infer transmission topology, and those that rely exclusively on epidemiological timing to score candidate infectors\. The present work occupies a specific position relative to both tracks and introduces a third dimension that neither track has systematically addressed: the reliability of the transmission labels on which all methods, regardless of modality, ultimately depend\.
##### Parametric and nonparametric timing methods\.
The foundational framework for timing\-based source attribution was established byWallinga and Teunis \[[24](https://arxiv.org/html/2606.30842#bib.bib1)\], who derived probabilistic infector assignment probabilities from case onset times and a known serial interval distribution\. This framework demonstrated that temporal evidence alone carries substantial discriminative signal and motivated subsequent work on generation interval estimation\[[23](https://arxiv.org/html/2606.30842#bib.bib2),[10](https://arxiv.org/html/2606.30842#bib.bib17)\]and time\-varying reproduction number inference\[[5](https://arxiv.org/html/2606.30842#bib.bib3)\]\. Disease\-specific serial interval characterization has been pursued across a range of pathogens; the 14\-year Nipah virus surveillance program in Bangladesh\[[18](https://arxiv.org/html/2606.30842#bib.bib32)\]illustrates the sustained epidemiological effort required to produce reliable interval estimates for a single pathogen family\. Parametric approaches fit a distributional family, typically Gaussian, Gamma, or Lognormal, to observed source interval data and score candidate pairs by the resulting likelihood\. Their principal strength is interpretability; their structural limitation is that family misspecification introduces systematic bias whenever the target outbreak’s gap distribution departs from the assumed form\. Nonparametric alternatives such as kernel density estimators relax the shape assumption but remain sensitive to bandwidth choice and extrapolate poorly to gap values underrepresented in the source sample\. Neither class learns a transferable representation that can be locked before target data are accessed; both require distributional fitting to the target outbreak directly or to a presumed universal reference\. The present study replaces this assumption with a discriminatively trained logistic model whose parameters are fixed from source data alone and verified to reproduce identical outputs on any target benchmark\.
##### Genome\-integrated transmission reconstruction\.
A second method family integrates pathogen sequence data with epidemiological evidence to reconstruct transmission trees or posterior distributions over trees\.Didelotet al\.\[[9](https://arxiv.org/html/2606.30842#bib.bib14)\]formalized Bayesian inference of infectious disease transmission from whole\-genome sequence data; subsequent extensions addressed partially sampled and ongoing outbreaks in which unsequenced or unsampled intermediate hosts create observational gaps\[[8](https://arxiv.org/html/2606.30842#bib.bib15)\]\. Theoutbreaker2platform\[[2](https://arxiv.org/html/2606.30842#bib.bib16)\]provides a modular framework for joint epidemiological and evolutionary inference, enabling flexible model specification across outbreak types\. The structured coalescent model inSCOTTI\[[7](https://arxiv.org/html/2606.30842#bib.bib8)\]treats each host as a structured subpopulation, making within\-host diversity and transmission explicit inference components\. Variant\-aware approaches\[[6](https://arxiv.org/html/2606.30842#bib.bib12)\]and epi\-genomic integration frameworks\[[13](https://arxiv.org/html/2606.30842#bib.bib9),[3](https://arxiv.org/html/2606.30842#bib.bib31)\]extend these ideas by exploiting intrahost variant frequency data and pairwise genomic distances to sharpen transmission assignments\.ScITree\[[22](https://arxiv.org/html/2606.30842#bib.bib33)\]addresses scalability directly: it provides a Bayesian framework for joint inference from epidemiological and genomic data that is tractable for large outbreak datasets\. Recent preprint frameworks, including JUNIPER\[[19](https://arxiv.org/html/2606.30842#bib.bib27)\]and BREATH\[[4](https://arxiv.org/html/2606.30842#bib.bib28)\], illustrate emerging directions in scalable or joint phylodynamic–epidemiological reconstruction; they are cited only as adjacent preprint frameworks, not as peer\-reviewed evidence or numerical comparators in this study\. These methods are powerful when per\-case whole\-genome sequences are available with sufficient within\-host diversity to resolve transmission at the individual\-case level\. They are not designed for the timing\-only regime addressed here, in which genomic data are absent, delayed, or insufficient for direct transmission inference; direct numerical comparison would therefore be methodologically inappropriate\. The present study positions itself as complementary to genome\-integrated reconstruction, not competitive with it\.
##### Transmission benchmark construction and label reliability\.
Comparative evaluation of reconstruction methods requires ground truth labels linking each case to its true infector\. In practice, such labels derive from contact investigations, phylogenetic proximity, or outbreak investigation reports; none constitutes direct observation of a transmission event\. TheOutbreakTreesresource\[[20](https://arxiv.org/html/2606.30842#bib.bib20)\]curates a multi\-disease collection of published transmission trees that has enabled cross\-method and cross\-disease comparison, but label quality in such resources is rarely formally characterized\.Kinget al\.\[[14](https://arxiv.org/html/2606.30842#bib.bib10)\]documented how errors in outbreak data structure propagate into reconstruction conclusions, motivating explicit provenance auditing before benchmark construction\. The present study extends this concern in two directions\. A systematic audit of 134,095 records from a large public outbreak repository recovered only 26 transmission edges usable for parent\-ranking under strict construction criteria, illustrating the structural scarcity of benchmark\-ready data\. A formal epi\-genomic concordance audit of epidemiologically linked mpox pairs\[[1](https://arxiv.org/html/2606.30842#bib.bib11)\]demonstrates that label uncertainty is not merely a data\-access problem but a fundamental property of how transmission events are recorded\. No prior work, to our knowledge, has quantified the downstream consequence of this uncertainty for specific public\-health prioritization decisions: specifically, whether choosing between strict and uncertainty\-aware graph construction changes which source cases are selected under fixed\-capacity response scenarios\.
##### Robustness diagnostics and uncertainty\-aware decision analysis\.
Statistical conclusions drawn from compact outbreak benchmarks are vulnerable to the influence of individual cases\. Robustness diagnostics for binary trial endpoints, developed under the fragility index framework\[[25](https://arxiv.org/html/2606.30842#bib.bib6)\], quantify how many outcome reversals would overturn a significance conclusion\. The present study adapts this concept to the paired ranking setting, computing the minimum number of task\-level reversals required to lose sign\-test significance for each method comparison\. This paired reversal index differs from the classical fragility index, which targets dichotomous trial outcomes, but shares its interpretive advantage: robustness expressed in units directly meaningful to the evaluation design\. Alongside resampling\-based inference\[[26](https://arxiv.org/html/2606.30842#bib.bib7)\]and leave\-one\-out influence diagnostics, this framework provides multidimensional robustness characterization suited to compact real\-outbreak benchmarks\. At the decision level,Hadjisotiriouet al\.\[[11](https://arxiv.org/html/2606.30842#bib.bib13)\]studied the consequences of graph\-level uncertainty for policy prioritization under deep uncertainty frameworks, arguing for stress\-testing conclusions across plausible alternative scenarios rather than optimizing against a single point forecast\. The present study operationalizes this principle in a concrete outbreak context by comparing strict and uncertainty\-expanded transmission graphs and measuring priority\-set instability under fixed response capacity\. The decision curve analysis framework\[[21](https://arxiv.org/html/2606.30842#bib.bib5)\]provides additional conceptual grounding for evaluating the operational consequences of threshold\-based classification decisions under uncertainty\.
TableLABEL:tab:related\_work\_comparisonsummarizes representative methods across these four dimensions, with particular attention to input data regime, inference target, and key assumptions differentiating each approach from the present study\.
Table 1:Comparison of representative transmission\-reconstruction and uncertainty\-analysis methods\.MethodInput regimeTarget and assumptionsMain strengthsMain limitations / role in this studyLocked learned temporal prior
*This study*Case onset dates and parent–child temporal gaps; no target\-specific genomic data required\.Candidate\-parent ranking under a strict benchmark\. Assumes transferable timing structure can be learned externally and locked before target evaluation\.Comparator\-fair timing\-only design; applicable when genomics are absent, delayed, sparse, or weakly informative; supports zero\-shot temporal\-prior transfer\.Cannot independently prove direct transmission; performance depends on informative temporal gaps; evaluated as a ranking prior, not a full transmission\-tree model\.
*Refs:*This study;\[[24](https://arxiv.org/html/2606.30842#bib.bib1),[5](https://arxiv.org/html/2606.30842#bib.bib3)\]\.Parametric serial\-interval priors
Gaussian, Gamma, LognormalSymptom\-onset or infection\-time intervals estimated from source outbreaks or literature\.Temporal likelihood weighting of candidate infectors\. Assumes the chosen distributional family fits the target outbreak\.Simple, interpretable, and widely used; suitable as transparent temporal baselines in outbreak analysis\.Sensitive to distributional misspecification and disease\-specific interval assumptions; not a learned transferable representation\.
*Refs:*\[[24](https://arxiv.org/html/2606.30842#bib.bib1),[23](https://arxiv.org/html/2606.30842#bib.bib2),[10](https://arxiv.org/html/2606.30842#bib.bib17),[18](https://arxiv.org/html/2606.30842#bib.bib32)\]\.Nonparametric timing prior
KDEObserved source\-outbreak timing intervals\.Flexible temporal\-density estimation without a fixed parametric family\. Assumes source intervals represent the target regime\.More flexible than Gaussian, Gamma, or Lognormal priors; avoids explicit parametric\-shape assumptions\.Bandwidth\-sensitive; weak extrapolation outside observed support; a density estimator, not a discriminative learned ranking prior\.
*Refs:*\[[24](https://arxiv.org/html/2606.30842#bib.bib1),[23](https://arxiv.org/html/2606.30842#bib.bib2)\]\.Wallinga–Teunis source attributionCase onset times and a known or estimated serial\-interval distribution\.Probabilistic source attribution among candidate infectors\. Assumes reliable serial\-interval information and adequate case ascertainment\.Foundational timing\-only source\-attribution framework; produces interpretable infector probabilities\.Does not model genomic, graph\-level, or label uncertainty; requires a suitable serial\-interval reference\.
*Refs:*\[[24](https://arxiv.org/html/2606.30842#bib.bib1)\]\.outbreaker2Sampling dates, contact information, and pathogen genetic sequences\.Bayesian posterior inference over transmission trees\. Assumes suitable epidemiological and evolutionary model specifications\.Joint epidemiological–sequence inference; modular likelihood design; widely used for outbreak reconstruction\.Requires sequence data and careful model specification; computational cost scales with outbreak size\.
*Refs:*\[[2](https://arxiv.org/html/2606.30842#bib.bib16),[9](https://arxiv.org/html/2606.30842#bib.bib14)\]\.SCOTTIMultiple pathogen sequences per host plus host sampling times\.Host\-to\-host transmission under a structured coalescent model\. Assumes hosts can be represented as structured subpopulations\.Models within\-host evolution and possible unsampled intermediates; richer than consensus\-only approaches\.High computational and modelling demands; depends on assumptions about within\-host diversity and evolutionary structure\.
*Refs:*\[[7](https://arxiv.org/html/2606.30842#bib.bib8)\]\.Genome\-integrated inference in partially sampled outbreaksPathogen genomes, sampling dates, and epidemiological context\.Transmission inference with incomplete sampling and possible unobserved hosts\. Assumes the phylogeny–transmission relationship is adequately modelled\.Makes the unsampled\-host problem explicit; suited to ongoing outbreaks with rolling sequencing\.Within\-host diversity, missing cases, and incomplete sampling remain major uncertainty sources\.
*Refs:*\[[8](https://arxiv.org/html/2606.30842#bib.bib15),[13](https://arxiv.org/html/2606.30842#bib.bib9)\]\.Variant\-aware epi\-genomic integration
BadTrIP; Carson et al\.Within\-outbreak genomic variants, epidemiological links, and intrahost variant\-frequency data\.Transmission reconstruction using variant frequencies and pairwise genomic evidence\. Assumes variant dynamics are informative and estimable\.Exploits richer genomic information than consensus\-only methods; improves discrimination among candidate transmission pairs\.Requires dense sequencing and strong assumptions about within\-host evolutionary dynamics; not available in timing\-only settings\.
*Refs:*\[[6](https://arxiv.org/html/2606.30842#bib.bib12),[3](https://arxiv.org/html/2606.30842#bib.bib31)\]\.ScITreeEpidemiological data and pathogen genome sequences\.Scalable Bayesian posterior inference over transmission trees from combined epidemiological and genomic likelihoods\.Designed for larger outbreaks; accounts for epidemiological and genomic uncertainty more efficiently than full phylodynamic models\.Requires sequence data; not applicable to the timing\-only parent\-ranking regime evaluated here\.
*Refs:*\[[22](https://arxiv.org/html/2606.30842#bib.bib33)\]\.JUNIPER; BREATHDense per\-case sequencing with epidemiological and phylodynamic information\.Full posterior transmission\-tree inference under joint phylodynamic–epidemiological models\.Data\-rich outbreak reconstruction; principled posterior inference over transmission histories\.Requires high\-quality genomic data and sufficient within\-host signal; cited as emerging preprint frameworks only, not as peer\-reviewed evidence or numerical comparators in this study\.
*Refs:*\[[19](https://arxiv.org/html/2606.30842#bib.bib27),[4](https://arxiv.org/html/2606.30842#bib.bib28)\]\.Decision making under deep uncertaintyModel outputs assessed across multiple plausible future or structural scenarios\.Prioritization under structural uncertainty\. Assumes relevant uncertainty scenarios can be enumerated and stress\-tested\.Directly relevant to threshold instability, top\-kkpriority changes, and policy\-facing robustness\.Not a transmission reconstruction model; used here as conceptual grounding for decision\-instability analysis\.
*Refs:*\[[11](https://arxiv.org/html/2606.30842#bib.bib13),[21](https://arxiv.org/html/2606.30842#bib.bib5)\]\.Fragility and reversal robustness diagnosticsBinary, paired, or directional outcomes from a primary analysis\.Minimum outcome reversals required to change a statistical conclusion\.Expresses robustness in task or trial counts; complements bootstrap and permutation tests\.Classical fragility index targets dichotomous trial endpoints; adaptation is required for paired ranking designs\.
*Refs:*\[[25](https://arxiv.org/html/2606.30842#bib.bib6),[26](https://arxiv.org/html/2606.30842#bib.bib7)\]\.Note\.The present study occupies the first row\. Other methods are included to clarify differences in input requirements, inference targets, assumptions, strengths, and limitations relative to the timing\-only locked\-prior ranking regime evaluated here\. Methods requiring dense genomic, phylodynamic, or within\-host variant evidence are discussed as adjacent outbreak\-reconstruction frameworks rather than as direct numerical comparators\.
## 3Methods
The study design is illustrated in Figure[1](https://arxiv.org/html/2606.30842#S3.F1)\. A temporal transmission prior is trained on a multi\-disease benchmark, locked before any target outbreak data are accessed, and evaluated on external outbreak benchmarks under strict transfer conditions\. Transmission\-label uncertainty is assessed independently through two additional evidence modules\. Algorithm[1](https://arxiv.org/html/2606.30842#algorithm1)formalizes the benchmark construction procedure applied consistently across all datasets\.
Figure 1:Study design and evidence architecture\.A source\-trained temporal prior is learned on the D1 multi\-disease benchmark, locked before target access, and evaluated on the strict ANDV parent\-ranking benchmark\. Separate MPXV and Guangdong Delta modules quantify transmission\-label reliability and graph\-level decision uncertainty; these modules support uncertainty analysis rather than prior validation\.Input:Outbreak case set
𝒱\\mathcal\{V\}with documented case times
d\(c\)d\(c\); directed transmission\-evidence edge set
ℰ\\mathcal\{E\}; edge\-confidence labels
q\(e\)q\(e\); feasible temporal window
\[wmin,wmax\]\[w\_\{\\min\},w\_\{\\max\}\]\.
Output:Strict parent\-ranking benchmark
ℬ=\{\(j,𝒞j,pj\)\}\\mathcal\{B\}=\\\{\(j,\\mathcal\{C\}\_\{j\},p\_\{j\}\)\\\}\.
Initialize
ℬ←∅\\mathcal\{B\}\\leftarrow\\emptyset;
foreach*casej∈𝒱j\\in\\mathcal\{V\}*do
if*d\(j\)d\(j\)is missing*thencontinue;
Let
𝒫jdoc=\{i:\(i,j\)∈ℰ\}\\mathcal\{P\}^\{\\mathrm\{doc\}\}\_\{j\}=\\\{i:\(i,j\)\\in\\mathcal\{E\}\\\};
if*\|𝒫jdoc\|≠1\|\\mathcal\{P\}^\{\\mathrm\{doc\}\}\_\{j\}\|\\neq 1*thencontinue ;
//unique documented parent
;
Set
pjp\_\{j\}to the single element of
𝒫jdoc\\mathcal\{P\}^\{\\mathrm\{doc\}\}\_\{j\};
if*d\(pj\)d\(p\_\{j\}\)is missing*thencontinue;
if*q\(pj,j\)q\(p\_\{j\},j\)is not classified as strict or high\-confidence*thencontinue ;
//edge\-confidence filter
;
Compute
Δj=d\(j\)−d\(pj\)\\Delta\_\{j\}=d\(j\)\-d\(p\_\{j\}\);
if*Δj<wmin\\Delta\_\{j\}<w\_\{\\min\}orΔj\>wmax\\Delta\_\{j\}\>w\_\{\\max\}*thencontinue;
Initialize
𝒞j←∅\\mathcal\{C\}\_\{j\}\\leftarrow\\emptyset;
foreach*casei∈𝒱∖\{j\}i\\in\\mathcal\{V\}\\setminus\\\{j\\\}*do
if*d\(i\)d\(i\)is missing*thencontinue;
Compute
Δij=d\(j\)−d\(i\)\\Delta\_\{ij\}=d\(j\)\-d\(i\);
if*wmin≤Δij≤wmaxw\_\{\\min\}\\leq\\Delta\_\{ij\}\\leq w\_\{\\max\}*thenAdd
iito
𝒞j\\mathcal\{C\}\_\{j\};
if*pj∉𝒞jp\_\{j\}\\notin\\mathcal\{C\}\_\{j\}*thencontinue ;
//documented parent must be temporally eligible
;
Add
\(j,𝒞j,pj\)\(j,\\mathcal\{C\}\_\{j\},p\_\{j\}\)to
ℬ\\mathcal\{B\};
return
ℬ\\mathcal\{B\};
Algorithm 1Strict candidate\-infector ranking benchmark construction\.### 3\.1Problem Formulation: Candidate\-Infector Ranking
Transmission reconstruction is framed as a candidate ranking problem\. For each target casejj, a candidate set𝒞\(j\)\\mathcal\{C\}\(j\)is defined as all cases whose documented onset dates fall within a feasible temporal window precedingjj’s onset\. Exactly one candidate in𝒞\(j\)\\mathcal\{C\}\(j\)is the true parentpjp\_\{j\}, the case from whichjjacquired the pathogen\. The task is to assign the highest rank topjp\_\{j\}among all candidates\.
This formulation is operationally grounded: a contact tracer evaluating a newly confirmed case must score all temporally eligible predecessors and direct investigation toward the most probable source\. The ranking task is strictly more demanding than binary classification, because the model must correctly order within a set of temporally plausible candidates, not merely separate linked from unlinked pairs\.
Ranking performance is evaluated using mean reciprocal rank \(MRR\), defined as
MRR=1\|T\|∑j∈T1rank\(pj\),\\mathrm\{MRR\}=\\frac\{1\}\{\|T\|\}\\sum\_\{j\\in T\}\\frac\{1\}\{\\mathrm\{rank\}\(p\_\{j\}\)\},\(1\)whereTTdenotes the set of target cases andrank\(pj\)\\mathrm\{rank\}\(p\_\{j\}\)is the position of the true parent in the ranked candidate list for taskjj\. MRR is the primary evaluation metric because it penalizes false high\-confidence assignments more severely than Top\-kkaccuracy and rewards consistent near\-top placement across all tasks\. Additional metrics reported include Top\-1, Top\-3, and Top\-5 accuracy, normalized discounted cumulative gain \(NDCG\), and mean true\-parent rank\.
### 3\.2The D1 Multi\-Disease Training Benchmark
#### 3\.2\.1Source Data
The multi\-disease training benchmark, denoted D1, was constructed from a curated collection of published outbreak transmission trees spanning eleven disease groups\. Each tree records directed parent\-to\-child transmission links with associated symptom onset dates\. Diseases represented include measles, SARS\-CoV\-2, Ebola, influenza, norovirus, Legionella, tuberculosis, hepatitis A, Middle East respiratory syndrome, Nipah\[[18](https://arxiv.org/html/2606.30842#bib.bib32)\], and an orthopoxvirus group comprising smallpox and pre\-2022 clade I monkeypox outbreaks\.
#### 3\.2\.2Benchmark Construction
For each child casejj, a candidate parent set was constructed by pooling all cases whose onset dates fell within a 1–60\-day window beforejj’s onset\. This window spans the plausible serial\-interval range across the 11 D1 disease groups while excluding biologically implausible temporal orderings\. Cases outside this window were excluded\. After data cleaning, including removal of records with missing onset dates and elimination of self\-loops, the final D1 benchmark comprised:
- •14,919 candidate parent rows;
- •559 true parent rows;
- •559 unique ranked child\-reconstruction tasks;
- •11 disease groups\.
Each ranked task was verified to contain exactly one true parent among its candidates; tasks with zero or multiple strict positives were excluded\.
#### 3\.2\.3Leave\-One\-Disease\-Out Evaluation Design
Generalization across disease families was assessed using a leave\-one\-disease\-out \(LODO\) cross\-validation design\. In each of the 11 folds, all cases from one disease group are held out entirely; the temporal prior is trained on the remaining ten groups and evaluated on the held\-out group\. This design is more stringent than random partitioning: the model is evaluated on a disease family absent from its training set\.
Disease\-level MRR and Top\-1 estimates were computed for each fold\. Disease\-macro statistics were derived by averaging across folds, and bootstrap confidence intervals were obtained by resampling folds with replacement over 10,000 iterations\.
### 3\.3The Learned Temporal Prior
#### 3\.3\.1Feature Representation
For each candidate parent–child pair\(i,j\)\(i,j\), the primary input is the signed serial gap
Δt=onset\(j\)−onset\(i\),\\Delta t=\\mathrm\{onset\}\(j\)\-\\mathrm\{onset\}\(i\),\(2\)measured in integer days\. The final timing\-only feature set was derived exclusively from this gap and comprised the signed gap\(Δt\)\(\\Delta t\), the absolute gap\(\|Δt\|\)\(\|\\Delta t\|\), a binary indicator for whether the candidate source onset did not occur after the target onset\(Δt≥0\)\(\\Delta t\\geq 0\), binary polarity indicators for negative\(Δt<0\)\(\\Delta t<0\), zero\(Δt=0\)\(\\Delta t=0\), and positive\(Δt\>0\)\(\\Delta t\>0\)gaps, the squared signed gap\(Δt\)2\(\\Delta t\)^\{2\}, and the squared absolute gap\(\|Δt\|\)2\(\|\\Delta t\|\)^\{2\}\. Candidate\-parent eligibility was determined independently using a predefined temporal window of 1–60 days before child\-case onset, separating eligibility constraints from feature construction\. No demographic, clinical, spatial, contact\-network, or genomic attributes were incorporated; the model operates solely on temporal\-gap structure\.
#### 3\.3\.2Model Architecture
A logistic regression model is trained on the binary labely∈\{0,1\}y\\in\\\{0,1\\\}, wherey=1y=1indicates that caseiiis the documented parent of casejj\. The model outputP\(Δt\)=σ\(θ^⊤ϕ\(i,j\)\)P\(\\Delta t\)=\\sigma\(\\hat\{\\theta\}^\{\\top\}\\phi\(i,j\)\)represents the learned plausibility that a candidate with gapΔt\\Delta tis the true parent, conditional on membership in the candidate set\. Within each task, this score functions as a discriminative ranking weight, not a calibrated marginal probability\.
Logistic regression was selected over more complex architectures following systematic comparison on D1\. A listwise multilayer perceptron \(MLP\) achieved disease\-macro MRR of 0\.573 and a listwise linear ranker achieved 0\.564, both below the logistic regression value of 0\.575\. This ordering confirms that performance derives from the learned temporal structure, not from model capacity, and that a compact model is preferable for robustness across small\-sample disease folds\.
#### 3\.3\.3Prior Locking and External Validation Protocol
After training on D1, the learned prior is locked: no parameters are modified at any subsequent stage\. The locked model is saved as a serialized object, and its plausibility curve is stored as a precomputed lookup table over integer gap values from−30\-30to\+90\+90days\. Algorithm[2](https://arxiv.org/html/2606.30842#algorithm2)formalizes this two\-phase evaluation protocol\.
All external evaluations load this saved artifact without refitting\. Reproducibility was confirmed by applying the serialized prior to the ANDV benchmark, recovering Top\-1 = 0\.3793 and MRR = 0\.5709 to four decimal places, consistent with training\-time estimates\.
#### 3\.3\.4Temporal Prior Shape Analysis
To characterize what the model captured, the plausibility curve was analyzed across gap values and compared across LODO folds\. The globally trained prior exhibits peak plausibility atΔt=20\.5\\Delta t=20\.5days, a peak\-normalized score of 1\.0, an 80% support window spanning 3\.25 to 37\.75 days, a 50% support window spanning 0\.25 to 46\.50 days, and a positive\-to\-negative gap area ratio of 11\.48\. These support windows describe properties of the learned plausibility curve and should not be confused with the 1–60 day candidate eligibility window used for benchmark construction \(Section[3\.2\.2](https://arxiv.org/html/2606.30842#S3.SS2.SSS2)\); they reflect where the prior assigns substantial probability mass, not which candidates are included in a ranking task\.
Fold\-to\-fold stability was assessed by computing Pearson and Spearman correlations across all 55 pairwise comparisons of the 11 LODO\-trained curves:
- •mean Pearson correlation: 0\.9917; median: 0\.9950;
- •mean Spearman correlation: 0\.9694; median: 0\.9838\.
This near\-invariance of curve shape across held\-out folds is not a performance measure; it is consistent with the model having captured a stable structural regularity in transmission timing rather than disease\-specific interval patterns\.
Figure 2:Locked temporal prior and source\-trained temporal baselines\. Curves show peak\-normalized plausibility as a function of parent–child temporal gap for the globally locked learned temporal prior and the Gaussian, KDE, Gamma, and Lognormal source\-trained temporal baselines\. The LODO fold\-stability statistics reported in Section[4\.2](https://arxiv.org/html/2606.30842#S4.SS2)were computed separately from the 11 fold\-trained curves\.
### 3\.4Baseline Comparators
The locked learned prior is compared against four parametric temporal\-likelihood baselines\. Each baseline assigns a score to candidate pair\(i,j\)\(i,j\)based solely on the gapΔt\\Delta tand a distributional fit to source training data\. All baselines are fit using D1 source data only, without access to any target outbreak timing, ensuring comparator fairness\.
##### Gaussian serial\-gap likelihood\.
A Gaussian distribution is fit by maximum likelihood to all observed source gaps\. Each candidate pair receives the Gaussian probability density evaluated atΔt\\Delta t\.
##### Kernel density estimate likelihood\.
A Gaussian kernel density estimator with bandwidth selected by Scott’s rule is fit to source gaps\. The score is the KDE density atΔt\\Delta t\.
##### Gamma positive\-gap likelihood\.
A Gamma distribution is fit by maximum likelihood to positive source gaps \(Δt\>0\\Delta t\>0\)\. Candidates withΔt≤0\\Delta t\\leq 0receive a score of zero\.
##### Lognormal positive\-gap likelihood\.
A lognormal distribution is fit by maximum likelihood to positive source gaps\. Candidates withΔt≤0\\Delta t\\leq 0receive a score of zero\.
In addition to the four fair baselines, results are reported against an Epuyén outbreak\-specific serial\-interval reference: a Gaussian fit to the Epuyén 2018–2019 ANDV serial intervals, with mean approximately 23 days and standard deviation approximately 7 days\. This comparator is outbreak\-contextual and is not a fair source\-trained baseline, because it was calibrated using target\-disease timing information\. It is included solely to contextualize the magnitude of the learned\-prior advantage relative to an informed disease\-specific reference\.
### 3\.5Real ANDV Benchmark Construction
#### 3\.5\.1Source Outbreaks
The primary external validation benchmark was constructed from two published person\-to\-person ANDV outbreak investigations\. Andes virus was not represented in D1 at any stage of training; the ANDV benchmark therefore constitutes a genuinely external zero\-shot evaluation on a pathogen family absent from the training distribution\.
##### Epuyén 2018–2019 outbreak\[[17](https://arxiv.org/html/2606.30842#bib.bib18)\]\.
This outbreak involved 34 confirmed cases across four generations of documented person\-to\-person transmission in a rural Patagonian community in Argentina\. Case\-level symptom onset dates, contact exposures, and inferred transmission links, including high\-confidence parent\-to\-child pairs and lower\-confidence alternative links, were extracted from the published report and its supplementary appendix\. High\-confidence edges from the main text and Supplementary Table S2 were used as primary true\-parent assignments\.
##### 2014 Argentina cluster\.
A three\-case sequential transmission chain was documented in published surveillance records, with onset dates and directional transmission confirmed by full\-length sequencing\.
Combined, the two sources contributed 37 cases and, after applying the eligibility criteria described below, 29 strict documented parent\-to\-child edges\.
#### 3\.5\.2Strict Edge Inclusion Criterion
A parent\-to\-child pair\(i,j\)\(i,j\)is included in the strict benchmark only when all four of the following conditions are satisfied:
1. 1\.Both onset dates are documented in the source record and are not imputed\.
2. 2\.The serial gapΔt=onset\(j\)−onset\(i\)\\Delta t=\\mathrm\{onset\}\(j\)\-\\mathrm\{onset\}\(i\)satisfies1≤Δt≤601\\leq\\Delta t\\leq 60days\.
3. 3\.The edge is classified as high\-confidence or definitively confirmed in the source document; edges marked as uncertain, possible, or alternative are excluded\.
4. 4\.Casejjhas no other high\-confidence documented parent, satisfying the strict unique\-parent requirement\.
Uncertain and alternative links excluded under criterion 3 are retained separately for the uncertainty\-expansion analysis described in Section[3\.11](https://arxiv.org/html/2606.30842#S3.SS11)\. This separation ensures that strict evidence drives the primary method\-performance claims and that uncertain evidence contributes only to the secondary structural\-uncertainty analysis\.
After applying all criteria, the final strict ANDV benchmark contained:
- •37 total cases \(34 Epuyén, 3 from the 2014 cluster\);
- •29 strict true parent\-to\-child edges;
- •29 unique ranked child\-reconstruction tasks;
- •395 candidate parent rows, with a mean of 13\.6 candidates per task \(range 1 to 28\); 27 tasks contained at least two candidate parents\.
Each task was verified to contain exactly one positive true parent\.
#### 3\.5\.3Candidate Set Construction
For each child casejj, the candidate\-parent set𝒞j\\mathcal\{C\}\_\{j\}comprises all cases with documented onset dates falling within
\[onset\(j\)−60,onset\(j\)−1\]\[\\,\\mathrm\{onset\}\(j\)\-60,\\;\\mathrm\{onset\}\(j\)\-1\\,\]\(3\)days before the child\-case onset\. This window mirrors the D1 construction window and spans the full documented range of ANDV serial intervals observed in the Epuyén outbreak \(4–45 days\)\. All temporally eligible cases within the window are included as candidate parents; cases other than the strict documented parent are treated as negative candidates for ranking\.
Two strict tasks contained only one temporally eligible predecessor\. These singleton candidate sets were retained in the primary 29\-task strict benchmark because the documented parent was uniquely determined under the eligibility rule\. Tasks with singleton candidate sets contribute deterministicMRR=1\.0\\mathrm\{MRR\}=1\.0for all methods and therefore do not contribute to discrimination between methods\. They are retained in the primary benchmark for completeness, and all key comparisons are replicated on the 27 nontrivial tasks containing at least two candidate parents in the candidate\-window sensitivity analysis of Section[4\.9](https://arxiv.org/html/2606.30842#S4.SS9)\.
### 3\.6Global\.Health Public Outbreak Repository Audit
To assess the suitability of large public outbreak repositories for strict parent\-ranking benchmark construction, a systematic audit was conducted on the Global\.Health line\-list database\. Records were downloaded and screened for the fields required by Algorithm[1](https://arxiv.org/html/2606.30842#algorithm1): a documented onset date for each case, a directed transmission edge to a named parent case, a confidence label for that edge, and a unique\-parent structure\. The total corpus examined comprised 134,095 records\. The strict benchmark\-construction pipeline \(Algorithm[1](https://arxiv.org/html/2606.30842#algorithm1), withwmin=1w\_\{\\min\}=1day andwmax=60w\_\{\\max\}=60days\) was applied to all records satisfying the schema requirements\.
Only 26 transmission edges meeting the strict unique\-parent and high\-confidence criteria were recovered across the entire corpus, and zero ranked child\-reconstruction tasks could be constructed under the verified onset\-date requirement\. Large public line\-list repositories are therefore structurally unsuitable as sources of strict parent\-ranking benchmarks under the criteria applied here, confirming the need for targeted, manually curated outbreak records \(Sections[3\.5](https://arxiv.org/html/2606.30842#S3.SS5)and[3\.7](https://arxiv.org/html/2606.30842#S3.SS7)\) as the basis for external validation\.
### 3\.7Sudan Virus Disease Pilot Benchmark
A pilot parent\-ranking evaluation was conducted on a reconstructed Sudan virus disease \(SVD\) transmission network from a documented Uganda outbreak\. This evaluation is explicitly exploratory: the SVD dataset does not meet the sample\-size requirements for the primary robustness analyses applied to the ANDV benchmark, absolute onset dates were not uniformly recoverable \(Section[3\.7\.2](https://arxiv.org/html/2606.30842#S3.SS7.SSS2)\), and the locked learned prior was not applied using its serialized artifact under the zero\-shot transfer protocol used for the ANDV benchmark\. Performance estimates reported in Section[4\.8](https://arxiv.org/html/2606.30842#S4.SS8)should therefore be interpreted as relative\-time ranking feasibility evidence, not as confirmatory validation of the locked temporal prior\.
#### 3\.7\.1Network Reconstruction and Data Extraction
The transmission network was reconstructed from published surveillance records identifying case\-to\-case links with associated metadata from the 2022 Uganda Sudan virus disease outbreak\[[15](https://arxiv.org/html/2606.30842#bib.bib29),[12](https://arxiv.org/html/2606.30842#bib.bib30)\]\. Directed edges representing documented or highly probable transmission pairs were extracted, retaining only those for which relative temporal information was recoverable from the published records\.
#### 3\.7\.2Pilot Benchmark Construction
Because absolute onset dates were not uniformly available across all SVD cases, a relative\-time benchmark was constructed using the temporal ordering implied by the documented transmission sequence\. Candidate parent sets were defined using the same timing window applied in Algorithm[1](https://arxiv.org/html/2606.30842#algorithm1), parameterized withwmin=1w\_\{\\min\}=1day andwmax=60w\_\{\\max\}=60days\. Where absolute dates were absent, relative case\-ordering was used to assign approximate temporal positions consistent with the documented transmission sequence\.
The 1–60 day eligibility window was applied to approximate gap values derived from relative case positions\. Because absolute onset dates were unavailable, the ranking task compared the temporal plausibility of candidate parents defined by their relative position in the documented chain rather than by verified calendar dates\. Ranking behavior under this construction is characterized by two heuristics: a proximity\-to\-reference heuristic targeting a 14\-day gap, selected to match the approximate early\-phase SVD generation interval, and a shortest\-gap heuristic as a comparison condition\. These are temporal ranking heuristics applied to the relative\-time SVD benchmark; they are not equivalent to the locked learned prior applied under the zero\-shot protocol of Section[3\.3\.3](https://arxiv.org/html/2606.30842#S3.SS3.SSS3)\. Bootstrap resampling followed the protocol described in Section[3\.8](https://arxiv.org/html/2606.30842#S3.SS8); leave\-one\-task\-out diagnostics are reported for the best\-performing heuristic\.
### 3\.8Statistical Evaluation and Finite\-Sample Robustness
#### 3\.8\.1Primary Statistical Tests
For each comparison between the locked learned prior and a fair baseline on the ANDV benchmark, the following test statistics are computed\.
##### Paired MRR difference\.
The per\-task MRR difference for taskkkis
ΔMRR\(k\)=MRRlearned\(k\)−MRRbaseline\(k\)\.\\Delta\\mathrm\{MRR\}\(k\)=\\mathrm\{MRR\}\_\{\\mathrm\{learned\}\}\(k\)\-\\mathrm\{MRR\}\_\{\\mathrm\{baseline\}\}\(k\)\.\(4\)The meanΔMRR\\Delta\\mathrm\{MRR\}across all 29 tasks is the primary effect\-size estimate\.
##### Bootstrap confidence intervals\.
Tasks are resampled with replacement over 10,000 iterations\. The 95% confidence interval is the 2\.5th\-to\-97\.5th percentile of the bootstrap distribution of meanΔMRR\\Delta\\mathrm\{MRR\}\.
##### Permutation test\.
Under the null hypothesis of equivalence, the sign of each per\-taskΔMRR\\Delta\\mathrm\{MRR\}is flipped independently with probability 0\.5\. The mean of the permuted differences is computed over 100,000 repetitions\. The one\-sidedppvalue is the fraction of permuted means equalling or exceeding the observed meanΔMRR\\Delta\\mathrm\{MRR\}\.
##### Wilcoxon signed\-rank test\.
A one\-sided Wilcoxon signed\-rank test is applied to the 29 per\-taskΔMRR\\Delta\\mathrm\{MRR\}values to assess whether the task\-level advantage is systematically positive\.
##### Exact discordance test for Top\-1\.
Discordant task counts \(learned correct, baseline incorrect and vice versa\) are submitted to a one\-sided exact binomial sign test\.
PPvalues for the primary four fair\-baseline comparisons are reported without multiple\-comparison correction, consistent with the pre\-specified analysis plan\. Confirmatory reporting applies Benjamini\-Hochberg correction to the family of four comparisons\.
#### 3\.8\.2Finite\-Sample Robustness Analysis
Three supplementary analyses assess whether the ANDV conclusions depend on any small subset of tasks\. Results are shown in Figure[4](https://arxiv.org/html/2606.30842#S3.F4)\.
##### Leave\-one\-task\-out influence diagnostics\.
MeanΔMRR\\Delta\\mathrm\{MRR\}is recomputed after removing each of the 29 tasks in turn\. The maximum absolute shift across all 29 removals is reported as an influence measure\.
##### Jackknife standard error\.
The jackknife standard error of meanΔMRR\\Delta\\mathrm\{MRR\}is computed from the 29 leave\-one\-out estimates, providing a variance estimate that does not rely on the bootstrap exchangeability assumption\.
##### Paired task\-reversal robustness index\.
The minimum number of currently learned\-better task outcomes that would need to reverse to baseline\-better outcomes for the one\-sided sign\-test to become non\-significant atα=0\.05\\alpha=0\.05is computed separately for MRR and for Top\-1\. MRR and true\-parent rank are treated as the robustness\-backed primary metrics; Top\-1 is retained as a descriptive endpoint\.
Figure 3:ANDV parent\-ranking performance \(n=29n=29tasks\)\.Figure 4:Finite\-sample robustness diagnostics\.
#### 3\.8\.3Source\-Domain Influence and Negative\-Control Diagnostics
To assess whether the external ANDV result depended disproportionately on any single D1 source disease group, we performed additional source\-domain influence diagnostics using the recovered source\-only training protocol that reproduced the archived locked\-prior ANDV metrics to the reported precision\. In the first analysis, each D1 disease group was removed one at a time, the timing\-only logistic prior was retrained on the remaining source data, and the resulting model was evaluated on the unchanged 29\-task strict ANDV benchmark\. This leave\-one\-source\-domain\-out procedure quantified the influence of individual source disease groups on external ANDV ranking performance\.
Second, targeted combined\-removal analyses were performed for the most influential source group and the orthopox\-coded group\. The orthopox\-coded group was removed as a combined smallpox/orthopox category because the D1 metadata did not encode separable mpox\-only rows\. This analysis was included to test whether the external ANDV result could be attributed to an orthopox\-related source\-domain analogue rather than to a broader learned temporal structure\.
Third, as an exploratory negative control, D1 parent labels were randomly permuted 2,000 times while preserving the same temporal feature matrix, candidate\-set geometry, and training protocol\. Each permuted\-label model was then evaluated on the unchanged strict ANDV benchmark\. This control tested whether the observed ANDV ranking performance could be explained by candidate\-set geometry alone under randomized source labels\.
These diagnostics were not used to tune the locked prior, select source domains, or alter the primary reported model\. They were conducted only to characterize source\-domain influence and to assess whether the learned\-prior advantage persisted beyond source\-label randomization and individual source\-domain dependence\.
### 3\.9MPXV Label\-Uncertainty Module
#### 3\.9\.1Source Data
Transmission\-label reliability was assessed using published data from a genomic epidemiology study of the 2022 New York City mpox outbreak\[[1](https://arxiv.org/html/2606.30842#bib.bib11)\], which sequenced 1,138 MPXV genomes and applied phylogenetic analysis to 43 epidemiologically linked cases organized into 17 linked groups\. The study assigned each inter\-host linked pair a phylogenetic concordance category indicating the degree to which genomic evidence was consistent with direct transmission\. From the supplementary files, 94 epidemiologically linked pair rows were extracted\. After restricting to inter\-host pairs and excluding within\-host comparisons, the analysis dataset comprised 75 pair rows\.
#### 3\.9\.2Codebook and Classification
A four\-category deterministic classification scheme was pre\-specified and applied to each inter\-host pair based on the phylogenetic concordance category reported in the source:
Strict supported \(monophyletic\):sequences from both cases form a well\-supported exclusive monophyletic clade; direct transmission is strongly consistent with the genomic evidence\.
Potential supported \(shared ancestor\):sequences share a recent common ancestor but do not form a fully exclusive clade; indirect or mediated transmission is consistent\.
Unresolved \(inconclusive\):phylogenetic placement provides insufficient resolution to support or refute a direct transmission link\.
Not supported \(not\-linked\):sequences are phylogenetically distant in a manner inconsistent with a direct transmission event\.
The mapping from source concordance categories to these four classes was applied deterministically to all 75 rows, and the resulting category counts were verified to reproduce the aggregate values reported in the original study\.
#### 3\.9\.3Deterministic Codebook Reproducibility
The MPXV linked\-pair audit used a deterministic four\-category codebook\. Each phylogenetic concordance category from the source study was mapped to exactly one evidence class: monophyletic pairs to strict genomic support, shared\-ancestor pairs to potential genomic support, inconclusive pairs to unresolved evidence, and not\-linked pairs to not genomically supported evidence\. This mapping was applied systematically to all 75 inter\-host pair rows\.
Because the audit classes are fully determined by the published concordance labels, the classification can be reproduced directly from the codebook and source labels\. The aggregate category counts reported here are reproducible from the deterministic codebook and the published source concordance labels; pair\-level classification files are retained for journal submission and peer\-review reproducibility\.
#### 3\.9\.4Statistical Analysis
The primary quantity of interest is the combined proportion of unresolved and not\-supported pairs\. Exact binomial confidence intervals were computed for this proportion\. Because the confidence interval includes 0\.50, the MPXV result is interpreted descriptively as evidence of substantial label uncertainty in the audited linked\-pair set rather than as evidence that the underlying unresolved\-or\-not\-supported proportion exceeds one\-half\.
The role of this module is precisely delimited: it does not validate the learned temporal prior\. It provides independent empirical evidence bearing on the reliability of transmission labels in real outbreak data, supporting the broader claim that epidemiological linkage cannot be treated as deterministic ground truth\.
Figure 5:NYC mpox inter\-host concordance classes \(n=75n=75pairs\)\.
### 3\.10Guangdong Delta Transmission Graph
#### 3\.10\.1Source Data
Graph\-level structural uncertainty was assessed using the transmission visualization resource from a published study of a 2021 SARS\-CoV\-2 Delta variant outbreak in Guangdong Province, China\[[16](https://arxiv.org/html/2606.30842#bib.bib19)\], which traced 167 infections to a single index case\. The visualization distinguishes high\-confidence directed edges \(solid lines\) from uncertain edges \(dashed lines\)\. Machine\-readable graph data were extracted from the underlying asset filescases\.jsonandcase\_connections\_raw\.json\. After normalization, the extracted graph comprised:
- •131 visualization nodes \(122 human case nodes and 9 context or location nodes\);
- •142 total directed edges \(107 high\-confidence, 35 uncertain\);
- •67 high\-confidence human case\-to\-case edges;
- •57 strict unique\-parent high\-confidence case\-to\-case edges;
- •5 child cases with multiple high\-confidence possible parents\.
#### 3\.10\.2Analytical Role and Scope Limitation
The Guangdong Delta data do not validate the learned temporal prior\. An explicit attempt was made to recover case\-level timing fields, including onset dates, exposure dates, and infection dates, sufficient to construct a temporal parent\-ranking benchmark analogous to Section[3\.5](https://arxiv.org/html/2606.30842#S3.SS5)\. No defensible timing field was identified in the publicly accessible visualization assets; node\-level date fields exhibited zero and negative gaps consistent with administrative reporting dates rather than symptom onset dates\.
The Delta dataset therefore contributes exclusively to the structural\-uncertainty analysis described in Section[3\.11](https://arxiv.org/html/2606.30842#S3.SS11)\. It provides evidence that real transmission graphs contain uncertain edges, multi\-parent ambiguity, and alternative plausible links whose omission from a strict reconstruction changes inferred outbreak structure\. All analyses using this dataset are confined to graph\-level topology; no timing\-based method evaluation is performed\.
### 3\.11Transmission Label Uncertainty and Decision\-Instability Analysis
#### 3\.11\.1Uncertainty Expansion Framework
For both the ANDV and Guangdong Delta datasets, a sequence of edge sets with increasing inclusiveness is defined:
Strict set:high\-confidence, unique\-parent edges only; corresponds to the primary benchmark used for method evaluation\.
Preferred set \(ANDV only\):strict edges augmented with lower\-confidence alternative links for which one parent is clearly preferred\.
Full plausible set \(ANDV only\):all epidemiologically plausible links, including uncertain and alternative edges\.
High\-confidence set \(Delta only\):all solid\-line directed edges regardless of unique\-parent status\.
High\-plus\-unsure set \(Delta only\):all directed human case\-to\-case edges, including uncertain dashed\-line edges\.
For each expansion, outbreak\-level statistics are recomputed: total offspring count per source case, per\-case out\-degree, source\-case ranking by offspring count, and the set of cases meeting the high\-priority threshold of at least three documented or plausible offspring\.
#### 3\.11\.2Source\-Ranking Stability Metrics
Three metrics quantify structural change between strict and uncertainty\-expanded graphs\.
##### Top\-kksource\-set Jaccard similarity\.
Jaccard similarity between the sets of top\-kkranked source cases under strict versus expanded edge sets, ranked by offspring count\. Values below 1\.0 indicate that at least one case in the strict top\-kkset is absent from the expanded top\-kkset, or vice versa\.
##### Spearman and Kendall rank correlations\.
Rank correlations between source\-case offspring\-count rankings under strict versus expanded edge sets, computed across all source cases appearing in either set\.
##### Offspring Gini coefficient\.
The Gini coefficient of the offspring\-count distribution under each edge set, measuring the concentration of transmission burden across source cases\.
#### 3\.11\.3Decision\-Instability Analysis
The consequence of label uncertainty for a fixed\-capacity source\-prioritization scenario is operationalized as follows\. An outbreak response team is assumed to allocate investigation resources to at mostk=5k=5source cases\. The strict\-only and uncertainty\-expanded top\-5 source sets are compared by three measures\.
##### Top\-5 Jaccard similarity\.
Jaccard similarity between strict and expanded top\-5 source sets\.
##### Decision regret\.
The proportion of uncertainty\-aware top\-5 offspring burden not captured by strict\-only prioritization:
Regret=1−∑i∈Sstrict∩Sexpandedoffspringexpanded\(i\)∑i∈Sexpandedoffspringexpanded\(i\),\\mathrm\{Regret\}=1\-\\frac\{\\displaystyle\\sum\_\{i\\in S\_\{\\mathrm\{strict\}\}\\cap S\_\{\\mathrm\{expanded\}\}\}\\mathrm\{offspring\}\_\{\\mathrm\{expanded\}\}\(i\)\}\{\\displaystyle\\sum\_\{i\\in S\_\{\\mathrm\{expanded\}\}\}\\mathrm\{offspring\}\_\{\\mathrm\{expanded\}\}\(i\)\},\(5\)whereSstrictS\_\{\\mathrm\{strict\}\}andSexpandedS\_\{\\mathrm\{expanded\}\}are the strict and uncertainty\-expanded top\-5 sets respectively\.
This regret metric is directional: it treats the uncertainty\-expanded graph as the reference scenario and measures uncertainty\-aware offspring burden missed by strict\-only prioritization\. It is a sensitivity measure, not evidence that the expanded graph is objectively superior or that intervention outcomes would improve\.
##### Newly elevated sources\.
Cases crossing the high\-priority threshold \(at least 3 offspring\) under the expanded edge set but not under the strict set; these represent source cases that a strict\-only reconstruction would not flag for priority investigation\.
These metrics are reported separately for ANDV \(strict versus full plausible expansion\) and Guangdong Delta \(high\-confidence versus high\-plus\-unsure expansion\), and are illustrated in Figures[6](https://arxiv.org/html/2606.30842#S3.F6)and[7](https://arxiv.org/html/2606.30842#S3.F7)\.
Figure 6:Source structure under strict and expanded graphs\.Figure 7:Decision regret across response capacities \(k=2k=2–1010\)\.
### 3\.12Ethical Considerations and Data Availability
All data used in this study were derived from previously published outbreak investigations\. No new primary data were collected and no direct contact with study participants occurred\. Case\-level data from the Epuyén and 2014 Argentina outbreaks were extracted from published supplementary materials in which all individuals are anonymized\. NYC MPXV data were extracted from the supplementary files of a published study\[[1](https://arxiv.org/html/2606.30842#bib.bib11)\]in which informed consent procedures and institutional ethical approval are documented\. The Global\.Health audit used publicly accessible, openly licensed records\. The SVD pilot used data from a published surveillance report; no individually identifiable information was extracted or retained\.
All analyses in this preprint use data derived from previously published outbreak investigations and public resources cited in the text\. The present arXiv version reports aggregate benchmark statistics, model\-comparison results, and uncertainty analyses needed to evaluate the main claims\. Benchmark construction files, pair\-level audit tables, source\-domain diagnostic tables, and analysis scripts are retained for journal submission and peer\-review reproducibility\.
Table 2:Real\-World Evidence Modules and Their Analytical RolesEvidence ModulePrimary Role in This StudyDirectly Validates Learned Temporal Prior?Main Question AnsweredClaim BoundaryANDV:2014 Argentina cluster \+ 2018–2019 Epuyén outbreakDirect parent\-ranking validation benchmarkYesDoes the locked learned temporal prior recover true parents in real person\-to\-person ANDV transmission?Supports direct method\-performance claimsNYC MPXVepi–genomic linked\-pair resourceTransmission\-label uncertainty moduleNoAre epidemiological inter\-host links clean direct\-transmission labels?Supports label\-uncertainty claims onlyGuangdong Deltavisualization\-derived transmission graphGraph\-uncertainty and decision\-instability moduleNoDoes preserving ambiguous/unsure graph evidence change source ranking and prioritization?Supports uncertainty and prioritization\-instability claims onlyTable 3:Benchmark and Data Resource SummaryResourceExtracted Analytical UnitMain Usable DataSize Used in This StudyUsed ForANDV combined strict benchmarkRanked child\-parent tasksStrict parent–child links and temporal candidate parents29 ranked child groups; 174 method\-level rowsDirect method validationEpuyén strict benchmarkRanked child\-parent tasksStrict Epuyén transmission links27 ranked child groups; 162 method\-level rowsSensitivity/supporting validationEpuyén preferred sensitivity benchmarkRanked child\-parent tasksStrict \+ preferred lower\-uncertainty links32 ranked child groups; 192 method\-level rowsSensitivity analysisNYC MPXV linked\-pair auditInter\-host linked pair rowsPublished phylogenomic category per linked pair75 inter\-host pair rowsLabel\-uncertainty analysisGuangdong Delta graphDirected graph nodes and edgesHigh\-confidence and unsure case\-link edges131 nodes; 142 directed edgesGraph uncertainty and decision\-instabilityGuangdong Delta public extraction auditPCR and iSNV public files46\-subject PCR trajectories; 60 iSNV donor\-recipient pairs46 PCR subjects; 60 iSNV pairsData\-readiness audit onlyTable 4:Cross\-outbreak evidence that transmission labels are uncertain\.OutbreakStrict EvidenceExpanded EvidenceKey FindingANDV Epuyén27 edges40 plausible edgesThirteen additional plausible offspring links emerged, with nine source cases gaining possible offspring\. Strict\-only reconstruction therefore underestimates transmission burden\.NYC MPXV21 genomic pairs31 unresolved \+ 10 unsupportedForty\-one of 75 epidemiological links \(54\.67%\) cannot be interpreted as confirmed direct\-transmission labels, demonstrating substantial label uncertainty\.Guangdong Delta57 edges72 high\-confidence \+ unsureFifteen additional plausible offspring links were identified and eleven parent nodes gained offspring, altering inferred source burden and transmission structure\.
Abbreviation:MPXV = Monkeypox virus\.
Table 5:Fixed\-capacity source\-prioritization instability under transmission\-label uncertainty\.Dataset / ScenarioTop\-kkSetJaccardMissedCasesBurdenCoverageDecisionRegretANDV Epuyén \(strict vs\. full plausible\)50\.667P220\.92860\.0714Guangdong Delta \(strict vs\. high\+unsure\)50\.4295646, 56470\.96430\.0357
Note\.Top\-kkdenotes the prioritization capacity\. Decision regret is the fraction of uncertainty\-aware offspring burden missed by strict\-only prioritization\.
Table 6:High\-transmission source classification shifts under uncertainty expansion\.Dataset / ScenarioOffspringThresholdStrictCountExpandedCountNewSourcesSourceIDsANDV Epuyén \(strict vs\. full plausible\)≥3\\geq 3352P10, P22Guangdong Delta \(strict vs\. high\+unsure\)≥3\\geq 37815646
Note\."Expanded" denotes the uncertainty\-aware edge set\. Newly elevated sources exceed the high\-transmission threshold only after uncertainty\-expanded links are retained\.
## 4Results
### 4\.1Study Overview and Evidence Roles
The study evaluated two related but distinct questions: whether a locked learned temporal transmission prior can improve candidate\-parent ranking on a real outbreak benchmark without target\-specific refitting, and whether transmission\-label uncertainty in real outbreak data changes downstream reconstruction and prioritization conclusions\. The analytical modules served distinct roles, summarized in Table[2](https://arxiv.org/html/2606.30842#S3.T2)\. The D1 multi\-disease benchmark provides cross\-disease generalization evidence for the learned prior\. The ANDV benchmark is the sole direct external validation of prior performance on a real outbreak\. The Global\.Health audit characterizes benchmark readiness in large public repositories\. The SVD pilot provides exploratory relative\-time ranking feasibility evidence\. The MPXV and Guangdong Delta modules assess transmission\-label reliability and graph\-level structural uncertainty, respectively; neither validates the learned prior\.
Input:Source training benchmark
ℬsrc\\mathcal\{B\}\_\{\\mathrm\{src\}\}; target evaluation benchmark
ℬtgt\\mathcal\{B\}\_\{\\mathrm\{tgt\}\}; temporal feature map
ϕ\(i,j\)\\phi\(i,j\); ranking metrics
ℳ\\mathcal\{M\}\.
Output:Target\-benchmark ranking metrics for the locked learned temporal prior and all source\-trained temporal baselines\.
Phase 1: Source\-only prior estimation;
Construct source training pairs
𝒟src=\{\(ϕ\(i,j\),yij\)\}\\mathcal\{D\}\_\{\\mathrm\{src\}\}=\\\{\(\\phi\(i,j\),y\_\{ij\}\)\\\}from
ℬsrc\\mathcal\{B\}\_\{\\mathrm\{src\}\}, where
yij=1y\_\{ij\}=1if
iiis the documented parent of
jjand
yij=0y\_\{ij\}=0otherwise;
Fit the learned temporal model on
𝒟src\\mathcal\{D\}\_\{\\mathrm\{src\}\}only, yielding parameter vector
θ^\\hat\{\\theta\};
Estimate all baseline temporal scoring functions from source training gaps only: Gaussian, KDE, Gamma, and Lognormal;
Lock
θ^\\hat\{\\theta\}and all baseline parameters;
Do not update, refit, calibrate, or tune any temporal scoring function using
ℬtgt\\mathcal\{B\}\_\{\\mathrm\{tgt\}\};
Phase 2: Target evaluation under locked scoring;
foreach*target ranking task\(j,𝒞j,pj\)∈ℬtgt\(j,\\mathcal\{C\}\_\{j\},p\_\{j\}\)\\in\\mathcal\{B\}\_\{\\mathrm\{tgt\}\}*do
foreach*candidate parenti∈𝒞ji\\in\\mathcal\{C\}\_\{j\}*do
Compute temporal feature vector
ϕ\(i,j\)\\phi\(i,j\);
Score the candidate using the locked learned prior:
sL\(i,j\)=σ\(θ^⊤ϕ\(i,j\)\)\.s\_\{L\}\(i,j\)=\\sigma\(\\hat\{\\theta\}^\{\\top\}\\phi\(i,j\)\)\.
Score the same candidate using each locked baseline temporal function:
sb\(i,j\)=Sb\(d\(j\)−d\(i\)\),b∈\{Gaussian,KDE,Gamma,Lognormal\}\.s\_\{b\}\(i,j\)=S\_\{b\}\(d\(j\)\-d\(i\)\),\\quad b\\in\\\{\\mathrm\{Gaussian\},\\mathrm\{KDE\},\\mathrm\{Gamma\},\\mathrm\{Lognormal\}\\\}\.
For each method
mm, rank candidates in
𝒞j\\mathcal\{C\}\_\{j\}by
sm\(i,j\)s\_\{m\}\(i,j\)in descending order;
Let
rj\(m\)r\_\{j\}^\{\(m\)\}be the rank assigned to the strict true parent
pjp\_\{j\};
Compute child\-level ranking metrics:
MRRj\(m\)=1rj\(m\),Hit@kj\(m\)=𝕀\{rj\(m\)≤k\},\\mathrm\{MRR\}\_\{j\}^\{\(m\)\}=\\frac\{1\}\{r\_\{j\}^\{\(m\)\}\},\\quad\\mathrm\{Hit@\}k\_\{j\}^\{\(m\)\}=\\mathbb\{I\}\\\{r\_\{j\}^\{\(m\)\}\\leq k\\\},and the corresponding NDCG contribution;
Aggregate child\-level metrics over all tasks in
ℬtgt\\mathcal\{B\}\_\{\\mathrm\{tgt\}\};
Estimate uncertainty using child\-level bootstrap resampling and paired method comparisons;
returntarget\-benchmark metrics for the learned prior and all baselines;
Algorithm 2Locked temporal\-prior transfer evaluation
### 4\.2Cross\-Disease Transfer on the D1 Benchmark
Under the leave\-one\-disease\-out \(LODO\) evaluation design described in Section[3\.2\.3](https://arxiv.org/html/2606.30842#S3.SS2.SSS3), the locked learned prior was evaluated across all 11 D1 disease folds\. Relative to a within\-fold median\-gap baseline, the learned prior achieved higher MRR in 8 of 11 folds, tied in 2 folds, and underperformed in 1 fold\. Because the median\-gap rule is a simple heuristic rather than the strongest fair likelihood comparator, this fold\-count result is treated as descriptive rather than as the primary comparative evidence\.
The primary fair comparison was against the source\-trained Gaussian likelihood baseline\. In this comparison, the learned prior achieved a disease\-macro MRR of 0\.57495, compared with 0\.45225 for the Gaussian likelihood baseline, corresponding toΔMRR=\+0\.12270\\Delta\\mathrm\{MRR\}=\+0\.12270\(bootstrap 95% CI\[\+0\.02978,\+0\.25395\]\[\+0\.02978,\+0\.25395\]\)\. At the fold level, the learned prior outperformed the Gaussian likelihood baseline in 7 of 11 folds, tied in 3 folds, and underperformed in 1 fold\. Treating tied folds as non\-discordant, an exploratory one\-sided exact sign test for the fold\-level learned\-prior advantage over the Gaussian likelihood baseline yieldedp=0\.0352p=0\.0352, based on 7 wins and 1 loss\. These results indicate that the locked temporal prior provides a measurable cross\-disease ranking advantage over a fair source\-trained temporal likelihood comparator\.
Prior shape was near\-invariant across LODO folds\. Across all 55 pairwise comparisons of fold\-trained plausibility curves, the mean Pearson correlation was 0\.9917 \(median 0\.9950\), and the mean Spearman correlation was 0\.9694 \(median 0\.9838\)\. This stability suggests that the learned temporal structure is not dominated by idiosyncratic disease\-specific serial\-interval patterns, but instead reflects a reproducible timing signal across the D1 benchmark\.
Comparison with higher\-capacity listwise rankers further indicated that the observed performance was not driven by architectural complexity\. The listwise MLP achieved a disease\-macro MRR of 0\.57261, and the listwise linear ranker achieved 0\.56402, both below the logistic\-regression prior value of 0\.57495\. Few\-shot target\-specific adaptation also did not materially improve performance, producing a best meanΔMRR\\Delta\\mathrm\{MRR\}of only\+0\.00185\+0\.00185\(bootstrap 95% CI\[−0\.00189,\+0\.00774\]\[\-0\.00189,\+0\.00774\]\)\. Together, these analyses support the use of the zero\-shot locked prior rather than target\-specific refitting on small held\-out disease samples\.
### 4\.3Benchmark Readiness of the Global\.Health Repository
Application of Algorithm[1](https://arxiv.org/html/2606.30842#algorithm1)to the Global\.Health corpus showed that attrition occurred before model evaluation rather than during scoring\. Among 134,095 audited records, only 146 contained onset\-date information, 53 contained aContact\_IDfield, 105 contained aContact\_with\_casefield, and 77 contained a transmission field \(Table[7](https://arxiv.org/html/2606.30842#S4.T7)\)\. The parsing procedure extracted 53 candidate contact edges, of which 26 could be matched back to case identifiers in the same source files and satisfied the strict unique\-parent and high\-confidence criteria\. However, zero ranked child\-reconstruction tasks could be constructed under the verified onset\-date requirement and the pre\-specified candidate\-parent eligibility window \(wmin=1w\_\{\\min\}=1day,wmax=60w\_\{\\max\}=60days\)\. Although 11 proxy relative\-time tasks could be constructed using fallback dates, none satisfied the strict verified\-onset requirement used for the primary benchmark definition\. Thus, the zero\-task result reflects sparse benchmark\-ready transmission metadata rather than failure of the ranking model\. Large public line\-list repositories remain valuable for incidence tracking and descriptive surveillance, but they do not currently contain sufficient structured transmission information to support strict parent\-ranking benchmark construction at the scale assumed by automated pipelines\. This finding motivates the use of manually curated, source\-verified outbreak records for external validation, as described in Sections[3\.5](https://arxiv.org/html/2606.30842#S3.SS5)and[3\.7](https://arxiv.org/html/2606.30842#S3.SS7)\.
Table 7:Global\.Health audit attrition from raw records to benchmark\-ready transmission tasks\.QuantityCountTotal records audited134,095Records with onset date146Records withContact\_ID53Records withContact\_with\_case105Records with transmission field77Extracted contact edges53Recoverable matched edges26Proxy temporal tasks11Strict benchmark tasks0
Note\.Strict benchmark tasks require verified parent and child onset dates under Algorithm[1](https://arxiv.org/html/2606.30842#algorithm1)\. Proxy temporal tasks use fallback relative\-time information and are therefore not treated as strict validation tasks\.
### 4\.4The Real ANDV Benchmark as a Strict External Evaluation Task
The primary real\-outbreak evaluation combined ANDV transmission evidence from the 2014 Argentina cluster and the 2018–2019 Epuyén outbreak, as described in Section[3\.5](https://arxiv.org/html/2606.30842#S3.SS5)\. The combined strict benchmark comprised 29 ranked child\-reconstruction tasks, producing 174 method\-level evaluation rows across six evaluated methods\. Of these 29 tasks, 27 contained at least two temporally eligible candidate parents, while two were singleton candidate sets retained because the documented parent was the only eligible predecessor\. The Epuyén\-only strict benchmark comprised 27 tasks and 162 method\-level rows; the preferred\-sensitivity benchmark comprised 32 tasks and 192 rows\.
Uncertain and alternative transmission links were excluded from the primary benchmark and retained for the structural\-uncertainty analysis in Section[3\.11](https://arxiv.org/html/2606.30842#S3.SS11)\. This separation ensures that primary performance claims rest on high\-confidence evidence only\.
### 4\.5The Learned Prior Improved Real ANDV Parent\-Ranking
On the primary combined strict ANDV benchmark, the locked learned prior achieved the highest MRR and Top\-1 accuracy among all fair source\-trained temporal methods\. Full results are reported in Table[8](https://arxiv.org/html/2606.30842#S4.T8)and illustrated in Figure[3](https://arxiv.org/html/2606.30842#S3.F3)\.
The primary 29\-task metrics include two singleton candidate sets in which all methods receive deterministic reciprocal\-rank contributions of 1\.0\. These tasks are retained for completeness because the documented parent is the only temporally eligible predecessor, but they do not contribute to between\-method discrimination; the nontrivial 27\-task subset is evaluated separately in Section[4\.9](https://arxiv.org/html/2606.30842#S4.SS9)\.
The locked learned prior obtained Top\-1 accuracy of 0\.379 \(bootstrap 95% CI \[0\.207, 0\.552\]\), Top\-3 accuracy of 0\.759 \(\[0\.586, 0\.897\]\), MRR of 0\.571 \(\[0\.444, 0\.701\]; Equation[1](https://arxiv.org/html/2606.30842#S3.E1)\), NDCG of 0\.673 \(\[0\.575, 0\.773\]\), and mean true\-parent rank of 3\.379 \(\[2\.172, 5\.000\]\) with median rank 2\.0\.
The four fair source\-trained baselines performed substantially worse\. The Gaussian baseline achieved Top\-1 accuracy of 0\.138, Top\-3 accuracy of 0\.207, MRR of 0\.274, NDCG of 0\.432, and mean true\-parent rank of 7\.724\. The KDE, Gamma, and Lognormal baselines each achieved Top\-1 accuracy of 0\.103, Top\-3 accuracy of 0\.172, MRR values in the range 0\.236 to 0\.237, NDCG values of 0\.401, and mean true\-parent ranks of 8\.345 to 8\.379\. The learned prior more than doubled the MRR of the best fair generic temporal baseline and advanced median true\-parent rank from position 7 or 8 to position 2\.
The Epuyén serial\-interval reference achieved Top\-1 accuracy of 0\.345, Top\-3 accuracy of 0\.724, MRR of 0\.555, NDCG of 0\.662, and mean true\-parent rank of 3\.310\. As established in Section[3\.4](https://arxiv.org/html/2606.30842#S3.SS4), this comparator is disease\-contextual and does not constitute a fair source\-trained baseline\. The primary comparative conclusion is that the learned prior strongly outperformed all four fair source\-trained parametric temporal baselines and performed competitively with the outbreak\-contextual serial\-interval reference\.
Table 8:Primary ANDV parent\-ranking performance on the combined strict benchmark \(n=29n=29tasks\)\. Bracketed values denote bootstrap 95% confidence intervals from 10,000 resamples\.MethodnnTop\-1\(95% CI\)Top\-3\(95% CI\)MRR\(95% CI\)NDCG\(95% CI\)Mean Rank\(95% CI\)MedianRankLocked learned prior290\.379 \[0\.207, 0\.552\]0\.759 \[0\.586, 0\.897\]0\.571 \[0\.444, 0\.701\]0\.673 \[0\.575, 0\.773\]3\.379 \[2\.172, 5\.000\]2\.0Epuyén SI reference290\.345 \[0\.172, 0\.517\]0\.724 \[0\.552, 0\.862\]0\.555 \[0\.432, 0\.680\]0\.662 \[0\.566, 0\.757\]3\.310 \[2\.172, 4\.897\]2\.0Gaussian290\.138 \[0\.034, 0\.276\]0\.207 \[0\.069, 0\.345\]0\.274 \[0\.172, 0\.393\]0\.432 \[0\.350, 0\.526\]7\.724 \[5\.828, 9\.690\]7\.0KDE290\.103 \[0\.000, 0\.241\]0\.172 \[0\.034, 0\.310\]0\.236 \[0\.147, 0\.347\]0\.401 \[0\.328, 0\.489\]8\.379 \[6\.517, 10\.310\]8\.0Gamma290\.103 \[0\.000, 0\.241\]0\.172 \[0\.034, 0\.310\]0\.237 \[0\.147, 0\.347\]0\.401 \[0\.329, 0\.490\]8\.345 \[6\.483, 10\.276\]8\.0Lognormal290\.103 \[0\.000, 0\.241\]0\.172 \[0\.034, 0\.310\]0\.236 \[0\.147, 0\.347\]0\.401 \[0\.328, 0\.489\]8\.379 \[6\.517, 10\.310\]8\.0
Note\.The Epuyén serial\-interval reference is a contextual comparator and not a fair source\-trained baseline\. MRR = mean reciprocal rank; NDCG = normalized discounted cumulative gain\.
### 4\.6Paired Statistical Tests Confirmed Robust Gains Over Fair Baselines
Paired task\-level comparisons, described in Section[3\.8](https://arxiv.org/html/2606.30842#S3.SS8)and summarized in Table[9](https://arxiv.org/html/2606.30842#S4.T9), confirmed that the learned prior improved true\-parent ranking relative to every fair source\-trained baseline\.
Against the Gaussian baseline, the learned prior improved Top\-1 accuracy by 0\.241 \(\[0\.069, 0\.414\]\), MRR by 0\.297 \(\[0\.156, 0\.430\]; Equation[4](https://arxiv.org/html/2606.30842#S3.E4)\), and mean true\-parent rank by 4\.345 positions \(\[2\.690, 6\.069\]\)\. The learned prior was Top\-1 correct when the Gaussian baseline was not in 8 tasks; the Gaussian baseline was Top\-1 correct when the learned prior was not in 1 task\. The exact Top\-1 discordance test yieldedp=0\.019531p=0\.019531, the MRR sign\-flip permutation test yieldedp=0\.000200p=0\.000200, and the rank Wilcoxon test yieldedp=0\.000070p=0\.000070\.
Advantages were larger against the KDE, Gamma, and Lognormal baselines\. Against KDE, the learned prior improved MRR by 0\.335 \(\[0\.228, 0\.447\]\) and mean true\-parent rank by 5\.000 positions \(\[3\.552, 6\.517\]\), with 25 rank wins, 4 ties, and 0 losses\. Against Gamma, the corresponding improvements wereΔMRR=0\.334\\Delta\\mathrm\{MRR\}=0\.334\(\[0\.227, 0\.447\]\) and mean rank improvement of 4\.966 positions \(\[3\.517, 6\.517\]\), with 25 wins, 4 ties, and 0 losses\. Against Lognormal,ΔMRR=0\.335\\Delta\\mathrm\{MRR\}=0\.335\(\[0\.228, 0\.447\]\) and mean rank improvement of 5\.000 positions \(\[3\.552, 6\.517\]\), with 25 wins, 4 ties, and 0 losses\.
The difference between the learned prior and the Epuyén serial\-interval reference was small and not statistically significant\. The learned prior exceeded the reference by 0\.034 in Top\-1 accuracy and 0\.016 in MRR; confidence intervals for both differences included zero and all paired tests were non\-significant \(p\>0\.40p\>0\.40\)\. This is consistent with the interpretation in Section[3\.4](https://arxiv.org/html/2606.30842#S3.SS4): the reference is calibrated to the target disease and does not constitute a fair comparator\.
Table 9:Paired comparison of the locked learned prior against temporal comparators on the primary combined strict ANDV benchmark \(n=29n=29tasks\)\. Positive values favor the learned prior\.ComparatorTypeΔ\\DeltaTop\-1\(95% CI\)Δ\\DeltaMRR\(95% CI\)Rank Imp\.\(95% CI\)LOnlyCOnlyTop\-1𝐩\\mathbf\{p\}MRR𝐩\\mathbf\{p\}Rank𝐩\\mathbf\{p\}W/T/LGaussianFair0\.241 \[0\.069, 0\.414\]0\.297 \[0\.156, 0\.430\]4\.345 \[2\.690, 6\.069\]810\.01950\.00020\.0000723/5/1KDEFair0\.276 \[0\.138, 0\.448\]0\.335 \[0\.228, 0\.447\]5\.000 \[3\.552, 6\.517\]800\.00390\.000020\.00000625/4/0GammaFair0\.276 \[0\.138, 0\.448\]0\.334 \[0\.227, 0\.447\]4\.966 \[3\.517, 6\.517\]800\.00390\.000020\.00000625/4/0LognormalFair0\.276 \[0\.138, 0\.448\]0\.335 \[0\.228, 0\.447\]5\.000 \[3\.552, 6\.517\]800\.00390\.000020\.00000625/4/0Epuyén SIContext0\.034 \[−\-0\.138, 0\.207\]0\.016 \[−\-0\.100, 0\.130\]−\-0\.069 \[−\-0\.586, 0\.414\]430\.50000\.40320\.63558/14/7
### 4\.7Finite\-Sample Robustness of the ANDV Conclusions
The compact size of the strict ANDV benchmark \(29 tasks\) was addressed through four complementary robustness analyses, described in Section[3\.8\.2](https://arxiv.org/html/2606.30842#S3.SS8.SSS2)and illustrated in Figure[4](https://arxiv.org/html/2606.30842#S3.F4)\. Results are summarized in Table[10](https://arxiv.org/html/2606.30842#S4.T10)\.
The per\-task MRR distribution for the locked learned prior had mean 0\.5709 and standard deviation 0\.3609\. The bootstrap 95% confidence interval for learned\-prior MRR was \[0\.4426, 0\.6986\]; the corresponding interval for Top\-1 accuracy was \[0\.2069, 0\.5517\]\.
Leave\-one\-task\-out influence analysis confirmed that no individual task drove the headline result: the jackknife standard error of learned\-prior MRR was 0\.0670 and the maximum absolute leave\-one\-task\-out MRR shift was 0\.0187\. Paired bootstrapΔ\\DeltaMRR intervals were strictly positive against all four fair baselines:\+0\.297\+0\.297\(\[0\.156, 0\.430\]\) versus Gaussian,\+0\.335\+0\.335\(\[0\.228, 0\.447\]\) versus KDE,\+0\.334\+0\.334\(\[0\.227, 0\.447\]\) versus Gamma, and\+0\.335\+0\.335\(\[0\.228, 0\.447\]\) versus Lognormal\.
The paired task\-reversal robustness index showed that 7 currently learned\-better task outcomes would need to reverse to Gaussian\-better outcomes before one\-sided sign\-test significance was lost, and 8 reversals would be required against each of Gamma, KDE, and Lognormal\. The corresponding reversal counts for the Top\-1 discordance test were 1 against Gaussian and 2 against each of Gamma, KDE, and Lognormal\. Top\-1 is therefore retained as a descriptive summary statistic; all robustness claims are grounded in the rank\-sensitive MRR and true\-parent\-rank metrics\.
Table 10:Finite\-sample robustness diagnostics, strict ANDV benchmark \(n=29n=29tasks\)\.DiagnosticValueRanked child tasks29Learned\-prior MRR mean0\.5709Learned\-prior per\-task MRR SD0\.3609Learned\-prior MRR bootstrap 95% CI\[0\.4426, 0\.6986\]Learned\-prior Top\-1 bootstrap 95% CI\[0\.2069, 0\.5517\]Learned\-prior MRR jackknife SE0\.0670Maximum absolute leave\-one\-task\-out MRR shift0\.0187MRR reversals required to lose sign\-test significance vs Gaussian7MRR reversals required vs Gamma, KDE, Lognormal8Top\-1 reversals required vs Gaussian1Top\-1 reversals required vs Gamma, KDE, Lognormal2
### 4\.8SVD Pilot Evaluation
The pilot evaluation on the reconstructed SVD transmission network assessed whether temporal gap\-based parent ranking is feasible under the relative\-time benchmark construction described in Section[3\.7\.1](https://arxiv.org/html/2606.30842#S3.SS7.SSS1)\. Because absolute symptom\-onset dates were not uniformly recoverable, this analysis was not treated as a strict external validation of the locked learned temporal prior\. Instead, simple temporal ranking heuristics were evaluated on the reconstructed relative\-time benchmark: a proximity\-to\-reference heuristic centered on a biologically plausible 14\-day parent–child gap, additional proximity heuristics centered on shorter temporal gaps, and a shortest\-available\-gap heuristic as a comparison baseline\.
The reconstructed SVD benchmark contained 9 valid ranking tasks and 57 candidate\-parent rows, corresponding to a mean candidate\-set size of 6\.33 candidates per task \(range: 2–13\)\. Under random ranking, the expected MRR was 0\.462, computed as the task\-level mean ofHk/kH\_\{k\}/k, wherekkdenotes candidate\-set size andHkH\_\{k\}is thekkth harmonic number\. The 14\-day proximity rule achieved the strongest performance: Top\-1 accuracy of 0\.889 \(bootstrap 95% CI: 0\.667–1\.000\), Top\-3 accuracy of 0\.889, MRR of 0\.898 \(95% CI: 0\.694–1\.000\), NDCG of 0\.919, and mean true\-parent rank of 2\.22, corresponding to an approximately 1\.94\-fold MRR improvement over random ranking\. Proximity heuristics centered at 7, 8, and 10 days each achieved Top\-1 accuracy of 0\.778 and MRR of 0\.798\. The shortest\-gap heuristic performed substantially worse: Top\-1 accuracy of 0\.222 and MRR of 0\.368 \(95% CI: 0\.169–0\.631\), falling below the random\-ranking expectation\. Leave\-one\-task\-out sensitivity analysis of the 14\-day heuristic showed reasonable stability, with a maximum absolute MRR change of 0\.102 on removal of any individual task\.
These findings confirm that temporal ordering contains discriminative information in the reconstructed SVD network under relative\-time benchmark conditions and that proximity to a biologically plausible parent–child interval substantially outperforms a naive shortest\-gap strategy\. This experiment is not an external validation of the locked learned temporal prior\. The evaluation used temporal heuristics rather than the archived serialized prior artifact, comprised only nine valid ranking tasks, and depended on reconstructed relative timing rather than uniformly verified absolute symptom\-onset dates\. The SVD analysis provides exploratory feasibility evidence for relative\-time temporal ranking only; it should not be interpreted as an additional cross\-disease validation of the temporal\-prior transfer hypothesis\.
### 4\.9Candidate\-Window and Source\-Domain Sensitivity of the ANDV Benchmark
Window\-sensitivity analyses were conducted to determine whether the ANDV result depended on the pre\-specified 1–60 day candidate\-eligibility window\. The analysis was repeated across nested windows of 1–30, 1–45, 1–60, 4–45, 4–60, 7–45, and 7–60 days\. Under the full strict benchmark, the locked learned prior remained the strongest fair timing\-only method across all evaluated windows, with MRR ranging from 0\.571 to 0\.607\. The advantage over the Gaussian source\-trained baseline remained positive in every window, withΔMRR\\Delta\\mathrm\{MRR\}ranging from\+0\.189\+0\.189to\+0\.327\+0\.327\. Because two strict tasks contained only one temporally eligible predecessor, the analysis was repeated on the nontrivial multi\-candidate subset\. For the reference 1–60 day window, this subset is exactly the primary 29\-task strict benchmark after excluding the two singleton candidate sets; no additional tasks are added or removed\. In this 27\-task subset, the learned prior again remained strongest \(MRR = 0\.539 versus 0\.220 for the Gaussian baseline, using the same metric definition as in the primary benchmark\)\. Across all evaluated windows, the learned\-prior advantage over the source\-trained Gaussian baseline remained positive\. These results indicate that the learned\-prior advantage is not an artifact of the exact 1–60 day eligibility\-window choice\.
Using the recovered source\-only protocol, full\-D1 retraining reproduced the archived locked\-prior ANDV result to the reported precision, yieldingMRR=0\.5707\\mathrm\{MRR\}=0\.5707and Top\-1 accuracy of0\.37930\.3793\. Removing the orthopox\-coded/smallpox group from D1 did not attenuate ANDV performance: the retrained model achievedMRR=0\.5951\\mathrm\{MRR\}=0\.5951and Top\-1 accuracy of0\.41380\.4138on the unchanged 29\-task strict ANDV benchmark\. This weakens the alternative explanation that the ANDV result is driven by incidental orthopox\-coded training exposure\.
A source\-domain influence diagnostic identified MERS as the most influential source group\. Removing MERS reduced ANDV performance toMRR=0\.4620\\mathrm\{MRR\}=0\.4620\(ΔMRR=−0\.1087\\Delta\\mathrm\{MRR\}=\-0\.1087\), but the MERS\-removed model remained above both the random\-ranking expectation computed from ANDV candidate\-set sizes \(MRR=0\.3429\\mathrm\{MRR\}=0\.3429\) and the fair Gaussian source\-trained baseline from the primary analysis \(MRR=0\.274\\mathrm\{MRR\}=0\.274\)\. Removing MERS and the orthopox\-coded group together produced the sameMRR=0\.4620\\mathrm\{MRR\}=0\.4620, indicating that the attenuation was attributable to MERS rather than to orthopox\-coded examples\. Thus, MERS contributes useful source\-domain timing information, but the ANDV result does not collapse to baseline levels when this source group is excluded\.
As an exploratory negative control, D1 parent labels were randomly permuted 2,000 times while preserving the same feature matrix, candidate\-set geometry, and training protocol\. The shuffled\-label controls achieved mean ANDVMRR=0\.3270\\mathrm\{MRR\}=0\.3270, with a 95% empirical range of0\.19690\.1969–0\.58880\.5888; the archived locked\-prior MRR exceeded this null with empiricalp=0\.0475p=0\.0475\. Because the upper empirical range overlaps the observed value, this result is interpreted as exploratory negative\-control evidence rather than as a standalone confirmatory test\.
### 4\.10MPXV Evidence: Epidemiological Links Are Not Clean Transmission Labels
The MPXV label\-uncertainty module \(Section[3\.9](https://arxiv.org/html/2606.30842#S3.SS9)\) assessed whether epidemiologically linked inter\-host pairs can be treated as clean direct\-transmission ground truth\. Among 75 inter\-host linked pair rows, 21 were classified as strict genomically supported \(monophyletic\), 13 as potentially supported \(shared ancestor\), 31 as unresolved \(inconclusive\), and 10 as not supported \(not\-linked\), as shown in Table[11](https://arxiv.org/html/2606.30842#S4.T11)and Figure[5](https://arxiv.org/html/2606.30842#S3.F5)\.
Combining the*unresolved*and*not\-supported*categories, 41 of 75 pairs \(54\.67%; exact 95% CI: 42\.75–66\.21%\) could not be treated as clean direct\-transmission labels under a naive epidemiological\-linkage assumption\. Conversely, 34 of 75 pairs \(45\.33%\) received strict or potential broad genomic support\. Because the exact confidence interval includes0\.500\.50, we interpret this result as evidence of substantial label uncertainty in the audited linked\-pair set, rather than as statistical evidence that the unresolved\-or\-not\-supported proportion exceeds one\-half in the underlying outbreak process\.
This module does not validate the learned temporal prior\. It provides empirical evidence bearing on the reliability of transmission labels in published outbreak data and supports the premise that epidemiological linkage should not be treated as deterministic direct\-transmission ground truth when constructing or evaluating reconstruction benchmarks\.
Table 11:NYC MPXV inter\-host linked\-pair evidence classes \(n=75n=75pairs\)\. See Section[3\.9](https://arxiv.org/html/2606.30842#S3.SS9)for codebook definitions\. The two summary rows report aggregated proportions relevant to the primary analysis\.Evidence classCountFraction of 75 pairsStrict genomically supported \(monophyletic\)2128\.00%Potentially supported \(shared ancestor\)1317\.33%Unresolved \(inconclusive\)3141\.33%Not genomically supported \(not\-linked\)1013\.33%Unresolved or not supported4154\.67%Strict or potential broad support3445\.33%
### 4\.11Guangdong Delta Graph: Structural Uncertainty in a Real Transmission Network
The Guangdong Delta transmission graph \(Section[3\.10](https://arxiv.org/html/2606.30842#S3.SS10)\) was used to characterize graph\-level structural uncertainty\. The extracted visualization graph contained 131 nodes \(122 human case nodes and 9 context or location nodes\) and 142 directed edges \(107 high\-confidence and 35 uncertain\), as reported in Table[12](https://arxiv.org/html/2606.30842#S4.T12)\. Among human case\-to\-case directed edges, 67 were high\-confidence and 57 met the strict unique\-parent criterion\. Five child cases had multiple high\-confidence possible parents, constituting explicit multi\-parent ambiguity within the published reconstruction\.
No defensible case\-level timing field was recoverable from the public visualization assets, precluding construction of a temporal parent\-ranking benchmark analogous to the ANDV benchmark\. The Delta dataset therefore contributes exclusively to the graph\-level uncertainty analysis of Section[3\.11](https://arxiv.org/html/2606.30842#S3.SS11), not to temporal prior evaluation\.
Table 12:Guangdong Delta transmission graph: extraction summary\.Extracted quantityValueTotal visualization nodes131Human case nodes122Context or location nodes9Total directed edges142High\-confidence directed edges107Uncertain directed edges35High\-confidence case\-to\-case directed edges67Strict unique\-parent case\-to\-case edges57Child cases with multiple high\-confidence possible parents5
### 4\.12Transmission\-Label Uncertainty Altered Inferred Source Structure
Retention of uncertain edges under the uncertainty\-expansion framework \(Section[3\.11](https://arxiv.org/html/2606.30842#S3.SS11)\) changed inferred source structure in both datasets, as shown in Table[13](https://arxiv.org/html/2606.30842#S4.T13)and Figure[6](https://arxiv.org/html/2606.30842#S3.F6)\.
For ANDV Epuyén, the strict graph contained 27 edges; the full plausible graph contained 40 edges\. The expansion added 13 possible offspring links and caused 9 source cases to gain at least one possible offspring\. The top\-5 source\-set Jaccard similarity between strict and full plausible reconstructions was 0\.667\. Source\-rank agreement was moderate \(Spearmanρ=0\.569\\rho=0\.569, Kendallτ=0\.520\\tau=0\.520\)\. The offspring Gini coefficient decreased from 0\.691 to 0\.453, indicating reduced concentration of inferred transmission burden\.
For Guangdong Delta, the strict unique\-parent graph contained 57 case\-to\-case edges; the high\-confidence plus uncertain graph contained 72 edges\. The expansion added 15 possible offspring links and caused 11 parent nodes to gain at least one possible offspring\. The top\-5 source\-set Jaccard similarity was 0\.429\. Source\-rank agreement was moderate \(Spearmanρ=0\.679\\rho=0\.679, Kendallτ=0\.631\\tau=0\.631\)\. The offspring Gini coefficient decreased from 0\.500 to 0\.374\. The lower Jaccard value for Delta relative to ANDV reflects greater absolute structural ambiguity in the Delta network, consistent with its larger scale\.
Table 13:Graph\-level uncertainty alters inferred source structure\. Statistics are reported for the strict and uncertainty\-expanded edge sets\. Jaccard similarity below 1 indicates changes in the top\-5 source set, whereas the Gini shift reflects changes in offspring\-burden concentration\.DatasetStrictGraphExpandedGraphEdgeGainParentsGainingTop\-5JaccardSpearman𝝆\\boldsymbol\{\\rho\}Kendall𝝉\\boldsymbol\{\\tau\}GiniShiftANDV Epuyén2740\+1390\.6670\.5690\.5200\.691→0\.4530\.691\\rightarrow 0\.453Guangdong Delta5772\+15110\.4290\.6790\.6310\.500→0\.3740\.500\\rightarrow 0\.374
### 4\.13Transmission\-Label Uncertainty Changed Fixed\-Capacity Prioritization Conclusions
The decision\-instability analysis \(Section[3\.11](https://arxiv.org/html/2606.30842#S3.SS11), Equation[5](https://arxiv.org/html/2606.30842#S3.E5)\) evaluated whether uncertainty changes the composition of the top\-5 source\-case investigation set under a fixed\-capacity prioritization scenario\. Results are summarized in Table[14](https://arxiv.org/html/2606.30842#S4.T14)and illustrated in Figure[7](https://arxiv.org/html/2606.30842#S3.F7)\.
For ANDV Epuyén, the strict\-only and uncertainty\-aware top\-5 source sets had Jaccard similarity 0\.667\. Case P22 was absent from the strict\-only top\-5 but present in the uncertainty\-aware top\-5\. Strict\-only prioritization captured 92\.86% of the uncertainty\-aware possible top\-5 offspring burden, yielding a decision\-regret fraction of 0\.0714 \(Equation[5](https://arxiv.org/html/2606.30842#S3.E5)\)\.
For Guangdong Delta, the priority\-set Jaccard was 0\.429\. Source nodes 5646 and 5647 were absent from the strict\-only top\-5 but present in the uncertainty\-aware top\-5\. Strict\-only prioritization captured 96\.43% of the uncertainty\-aware possible burden, yielding a decision\-regret fraction of 0\.0357\.
The absolute regret fractions are moderate; these results should not be interpreted as evidence of universal disruption to outbreak response\. The operationally relevant finding is that the identity of the investigation\-priority set changed: specific source cases selected under uncertainty\-aware reconstruction would not be identified under strict\-only reconstruction\.
Table 14:Fixed\-capacity source\-prioritization instability under transmission\-label uncertainty\. Decision regret \(Eq\.[5](https://arxiv.org/html/2606.30842#S3.E5)\) denotes the fraction of uncertainty\-aware top\-5 offspring burden missed by strict\-only prioritization\.Dataset / ScenarioCapacityPriorityJaccardMissedPriority CasesBurdenCoverageDecisionRegretANDV Epuyén: strict vs\. full plausibleTop\-50\.667P220\.92860\.0714Guangdong Delta: strict vs\. high\-plus\-unsureTop\-50\.4295646; 56470\.96430\.0357
### 4\.14Uncertainty Elevated Source Cases Across High\-Transmission Thresholds
Applying an operational threshold of at least three plausible offspring to designate high\-priority source cases, the strict ANDV Epuyén graph identified 3 such cases while the uncertainty\-aware graph identified 5\. Cases P10 and P22 crossed the threshold only when uncertain links were retained \(Table[15](https://arxiv.org/html/2606.30842#S4.T15)\)\.
For Guangdong Delta, the strict graph identified 7 high\-priority source nodes and the high\-plus\-unsure graph identified 8\. Node 5646 was newly elevated under uncertainty\-aware reconstruction\. These results confirm that label uncertainty affects not only continuous ranking summaries but also discrete threshold\-based classifications that govern investigation\-triage logic\.
Node 5647 entered the uncertainty\-aware top\-5 priority set by relative source ranking but did not newly cross the absolute high\-transmission threshold of at least three offspring\. Accordingly, 5646 and 5647 are both counted as missed uncertainty\-aware top\-5 priority cases, whereas only 5646 is counted as a newly elevated high\-transmission source\.
Table 15:High\-transmission source classification shifts under uncertainty expansion\. The threshold of at least three offspring was applied consistently across both datasets and edge\-set conditions\.Dataset / ScenarioThresholdStrictUncertainty\-AwareNewly Elevated Source IDsANDV Epuyén: strict vs\. full plausible≥3\\geq 335P10; P22Guangdong Delta: strict vs\. high\-plus\-unsure≥3\\geq 3785646
### 4\.15Positioning Relative to Genome\-Integrated Reconstruction Methods
The four fair parametric baselines \(Gaussian, KDE, Gamma, Lognormal\) and the Epuyén serial\-interval reference operate in the same input regime as the present method: timing data only, producing candidate rankings\. They are therefore appropriate numerical comparators \(Section[3\.4](https://arxiv.org/html/2606.30842#S3.SS4)\)\. Methods includingoutbreaker2\[[2](https://arxiv.org/html/2606.30842#bib.bib16)\], SCOTTI\[[7](https://arxiv.org/html/2606.30842#bib.bib8)\], and epi\-genomic integration frameworks\[[3](https://arxiv.org/html/2606.30842#bib.bib31)\]address structurally different inference problems: they require per\-case genome sequences, intrahost variant data, or joint phylodynamic\-epidemiological likelihoods\. Direct numerical comparison on the strict ANDV timing\-only benchmark would be methodologically inappropriate because the input data regimes do not overlap\. These methods are positioned as related work addressing adjacent regimes \(Section[2](https://arxiv.org/html/2606.30842#S2)\), not as numerical competitors\.
### 4\.16Summary of Results
Four conclusions follow from the analyses above\. First, the locked learned prior generalizes across disease families in LODO evaluation on D1, achieving disease\-macro MRR of 0\.57495 versus 0\.45225 for the source\-trained Gaussian likelihood baseline \(ΔMRR=\+0\.12270\\Delta\\mathrm\{MRR\}=\+0\.12270; bootstrap 95% CI\[\+0\.02978,\+0\.25395\]\[\+0\.02978,\+0\.25395\]\), with a stable prior shape \(mean Pearsonr=0\.9917r=0\.9917across 55 pairwise fold comparisons\)\. Second, the locked prior substantially outperforms all four fair source\-trained temporal baselines on the primary real ANDV benchmark: MRR 0\.571 versus 0\.274 for the best fair baseline, all permutationppvalues at or below 0\.000200, maximum leave\-one\-task\-out MRR shift 0\.0187, and 7 to 8 task reversals required to lose sign\-test significance\. Source\-domain diagnostics further showed that the ANDV result was not attenuated by removing the orthopox\-coded D1 group and remained above both Gaussian and random\-ranking baselines even after removing the most influential source group, MERS\.
Third, real transmission evidence is not structurally clean: 54\.67% of MPXV epidemiologically linked inter\-host pairs are unresolved or not genomically supported, and the Guangdong Delta graph contains 35 uncertain directed edges and 5 cases with multi\-parent ambiguity\. Fourth, retaining uncertain links changes inferred source burden, top\-source set composition, offspring concentration, and threshold\-based high\-priority classifications in both datasets examined\. Together, these results support the position that uncertainty\-aware outbreak reconstruction changes which source cases are identified as priorities, and that treating transmission labels as clean deterministic ground truth excludes information associated with operationally verifiable differences in reconstruction conclusions\.
## 5Discussion
### 5\.1Integrated Meaning of the Findings
The study addresses two related but separable questions about outbreak transmission reconstruction\. The first is methodological: whether a temporal prior learned from multi\-disease data and locked before target evaluation can transfer to real outbreak parent\-ranking\. The second is evidentiary: whether real outbreak transmission labels can be treated as clean deterministic ground truth, and whether the answer affects practical reconstruction conclusions\.
On the methodological question, the strict ANDV benchmark provides the primary evidence \(Sections[4\.5](https://arxiv.org/html/2606.30842#S4.SS5)and[4\.7](https://arxiv.org/html/2606.30842#S4.SS7), Tables[8](https://arxiv.org/html/2606.30842#S4.T8)through[10](https://arxiv.org/html/2606.30842#S4.T10)and Figure[3](https://arxiv.org/html/2606.30842#S3.F3)\)\. The locked learned prior improved MRR from 0\.274 to 0\.571 relative to the Gaussian baseline and advanced median true\-parent rank from position 7 to position 2, with all permutationppvalues at or below 0\.000200\. The advantage held across all four finite\-sample robustness diagnostics: maximum leave\-one\-task\-out MRR shift 0\.0187; sign\-test significance required 7 to 8 task reversals to lose\. The D1 LODO analysis \(Section[4\.2](https://arxiv.org/html/2606.30842#S4.SS2)\) provides the generalization evidence: the prior achieved disease\-macro MRR of 0\.57495 versus 0\.45225 for the source\-trained Gaussian likelihood baseline \(ΔMRR=\+0\.12270\\Delta\\mathrm\{MRR\}=\+0\.12270; bootstrap 95% CI\[\+0\.02978,\+0\.25395\]\[\+0\.02978,\+0\.25395\]\), with a near\-invariant prior shape \(mean Pearsonr=0\.9917r=0\.9917across 55 fold comparisons\), confirming that the learned structure reflects a stable property of transmission timing rather than disease\-specific interval patterns\. The SVD pilot evaluation \(Section[4\.8](https://arxiv.org/html/2606.30842#S4.SS8)\) provides relative\-time ranking feasibility evidence that temporal gap proximity carries discriminative signal in the SVD network; it does not constitute a second validation of the locked learned prior, because the evaluation used temporal heuristics rather than the serialized locked\-prior artifact under the zero\-shot transfer protocol\.
On the evidentiary question, the MPXV and Guangdong Delta analyses \(Sections[4\.10](https://arxiv.org/html/2606.30842#S4.SS10)through[4\.14](https://arxiv.org/html/2606.30842#S4.SS14)\) provide evidence independent of the temporal prior\. The MPXV label audit found that 54\.67% of epidemiologically linked inter\-host pairs were unresolved or not genomically supported \(Table[11](https://arxiv.org/html/2606.30842#S4.T11), Figure[5](https://arxiv.org/html/2606.30842#S3.F5)\); the implication is not that such links are uninformative, but that they do not represent confirmed direct\-transmission ground truth\. The Guangdong Delta graph contained 35 uncertain directed edges, 5 cases with multiple high\-confidence possible parents, and no recoverable onset\-date field suitable for temporal ranking \(Table[12](https://arxiv.org/html/2606.30842#S4.T12)\)\. Retaining uncertain links reduced the top\-5 source\-set Jaccard similarity to 0\.667 for ANDV and 0\.429 for Delta, shifted offspring Gini coefficients substantially in both datasets, and caused specific source cases to cross the high\-priority threshold only under uncertainty\-aware reconstruction \(Tables[13](https://arxiv.org/html/2606.30842#S4.T13)through[15](https://arxiv.org/html/2606.30842#S4.T15), Figures[6](https://arxiv.org/html/2606.30842#S3.F6)and[7](https://arxiv.org/html/2606.30842#S3.F7)\)\. The decision\-regret fractions were moderate \(0\.0714 for ANDV, 0\.0357 for Delta; Equation[5](https://arxiv.org/html/2606.30842#S3.E5), Table[14](https://arxiv.org/html/2606.30842#S4.T14)\); these results do not prove that operational response decisions would change in all settings\. The finding of primary importance is a shift in priority\-set identity: specific source cases selected under uncertainty\-aware reconstruction would not be identified under strict\-only reconstruction\. The primary conclusion is therefore not that uncertainty prevents inference; it is that deterministic treatment of uncertain labels excludes information whose retention is associated with operationally verifiable differences in the inferred priority set\. The decision\-instability analysis therefore demonstrates sensitivity of priority\-set identity to retained edge uncertainty, not improvement in the priority set or prospective public\-health benefit\.
### 5\.2Regime Mapping Relative to Contemporary Reconstruction Methods
The present framework is complementary to, not competitive with, genome\-integrated outbreak reconstruction methods \(Section[2](https://arxiv.org/html/2606.30842#S2), TableLABEL:tab:related\_work\_comparison\)\. When per\-case whole\-genome sequences are available, within\-host diversity is sufficient to discriminate transmission pairs, and time permits sequencing and analysis, methods such asoutbreaker2\[[2](https://arxiv.org/html/2606.30842#bib.bib16)\], SCOTTI\[[7](https://arxiv.org/html/2606.30842#bib.bib8)\], and epi\-genomic integration frameworks\[[8](https://arxiv.org/html/2606.30842#bib.bib15),[3](https://arxiv.org/html/2606.30842#bib.bib31)\]are the appropriate tools for outbreak\-level transmission\-tree or posterior reconstruction\. These methods address richer inference problems than the timing\-only candidate\-parent ranking task studied here: they incorporate sequence evolution, phylogenetic uncertainty, incomplete sampling, and missing intermediate transmissions\. Forcing them into a fair numerical comparison on the strict ANDV timing\-only benchmark would be methodologically inappropriate because the data inputs do not overlap\.
The present study isolates a complementary regime: early outbreak response in which genomic data are absent, delayed, inconclusive due to low within\-pathogen diversity \(a documented characteristic of ANDV\), or not yet linkable to resolved exposure or onset timing\. In this regime, a locked temporal prior provides a source\-trained ranking signal applicable immediately, without target\-specific refitting\. The MPXV and Delta results further demonstrate that even when epidemiological or genomic evidence exists, derived labels may remain uncertain\. The more productive framing is therefore regime\-specific integration: timing\-only priors are the appropriate primary signal when timing evidence is the most reliable available input; genome\-integrated models are appropriate when sequencing data support transmission inference; and uncertainty\-aware graph analysis is warranted in both regimes, because real outbreak labels are structurally imperfect across evidence types\.
### 5\.3Limitations
The primary validation treats ANDV strict edges as ground truth while the MPXV module demonstrates that epidemiological links cannot be assumed to represent confirmed direct transmission\. These claims are in tension, and the distinction requires explicit justification\. TheMartínezet al\.\[[17](https://arxiv.org/html/2606.30842#bib.bib18)\]ANDV investigation differs structurally from the retrospective contact\-interview linkage audited in the MPXV module in three respects\. First, it was a prospective epidemiological investigation designed specifically to characterize person\-to\-person transmission chains, with sequential generation structure independently documented across multiple household clusters\. Second, the 2014 Argentina cluster underlying three of the 29 benchmark tasks was confirmed by full\-length genomic sequencing, providing multi\-modal corroboration of transmission direction\. Third, the strict edge\-inclusion criterion applied in Section[3\.5\.2](https://arxiv.org/html/2606.30842#S3.SS5.SSS2)excluded all edges marked as uncertain, possible, or alternative, retaining only those with affirmative high\-confidence classification and no competing documented parent: a criterion substantially more stringent than the epidemiological linkage criteria underlying the MPXV pairs\. Nonetheless, the ANDV strict edges cannot be certified entirely free of label error\. The appropriate interpretation is that any residual noise in the ground truth would attenuate the measured MRR advantage, making MRR = 0\.571 a conservative lower bound rather than an inflated estimate\. This conservative\-bound interpretation holds under the assumption that residual label errors in the strict ANDV benchmark are non\-differential with respect to temporal gap: longer\-gap pairs are not systematically more likely to carry undetected label error than shorter\-gap pairs, which are also the pairs where the learned prior assigns maximum plausibility\. This assumption cannot be directly verified from the available data\. The MPXV evidence is not a claim about ANDV label quality specifically; it is a demonstration that epidemiological linkage across pathogen families and investigation designs carries unquantified uncertainty that reconstruction frameworks should treat explicitly\.
Primary external validation of the learned prior is concentrated in the strict ANDV benchmark, which comprises 29 ranked child\-reconstruction tasks \(Section[3\.5](https://arxiv.org/html/2606.30842#S3.SS5)\)\. This is a consequence of the structural scarcity of documented directional person\-to\-person ANDV transmission events with verified onset dates in the published literature\. The Global\.Health repository audit \(Section[4\.3](https://arxiv.org/html/2606.30842#S4.SS3)\) confirmed that this scarcity is not specific to ANDV: 134,095 records yielded 26 recoverable transmission edges and zero usable parent\-ranking tasks under strict construction criteria\. The finite\-sample robustness analyses \(Section[4\.7](https://arxiv.org/html/2606.30842#S4.SS7), Table[10](https://arxiv.org/html/2606.30842#S4.T10)\) address the compact benchmark size directly: maximum leave\-one\-task\-out MRR shift 0\.0187; sign\-test significance required 7 to 8 task reversals to lose\. These diagnostics confirm that the observed advantage is not driven by individual influential tasks, but they do not substitute for a larger benchmark\. Future work should evaluate the locked prior on additional real\-outbreak parent\-ranking benchmarks as directional onset\-date data become available\. Consistent with the regime\-specific advantage structure of the learned prior \(Section[4\.2](https://arxiv.org/html/2606.30842#S4.SS2)\), a secondary evaluation on the Nipah Faridpur 2004 outbreak data reported byNikolayet al\.\[[18](https://arxiv.org/html/2606.30842#bib.bib32)\]\(median inter\-case gap: 13 days\) showed that the learned prior was competitive with, but did not consistently outperform, the Gaussian baseline, consistent with expectations for a short\-gap transmission setting outside the principal advantage window of the learned prior\.
The SVD pilot evaluation \(Section[4\.8](https://arxiv.org/html/2606.30842#S4.SS8)\) contributes relative\-time ranking feasibility evidence but does not constitute a validation of the locked learned prior\. Three distinct limitations apply: absolute onset dates were not uniformly recoverable, so the benchmark was constructed from relative temporal ordering rather than verified calendar dates; the evaluation applied temporal proximity heuristics rather than the serialized locked\-prior artifact under the zero\-shot protocol of Section[3\.3\.3](https://arxiv.org/html/2606.30842#S3.SS3.SSS3); and sample size was insufficient for the full robustness test battery applied to the ANDV benchmark\. These results are reported as exploratory feasibility evidence\. Establishing whether the locked learned prior generalizes to SVD requires a dedicated benchmark constructed from absolute onset dates and evaluated using the serialized locked\-prior artifact\.
The MPXV and Guangdong Delta modules contribute to the label\-uncertainty argument but do not validate the learned prior \(Sections[4\.10](https://arxiv.org/html/2606.30842#S4.SS10)and[4\.11](https://arxiv.org/html/2606.30842#S4.SS11)\)\. The MPXV module depends on published phylogenetic concordance categories from a single study\[[1](https://arxiv.org/html/2606.30842#bib.bib11)\]; the codebook was applied deterministically, with each source category mapping to exactly one of the four codebook classes without subjective judgment\. The aggregate classification counts are reproducible from the deterministic codebook and published source labels; pair\-level classification files are retained for journal submission and peer\-review reproducibility\.
The decision\-instability analysis \(Section[4\.13](https://arxiv.org/html/2606.30842#S4.SS13), Equation[5](https://arxiv.org/html/2606.30842#S3.E5)\) models a fixed\-capacity source\-prioritization scenario and does not constitute a prospective intervention study\. Whether the observed priority\-set changes would alter outbreak\-control outcomes depends on factors outside the scope of the present analysis: investigation capacity, source\-case network position, confirmatory data availability, and outbreak stage\. The decision\-regret fractions reported \(0\.0714 for ANDV, 0\.0357 for Delta\) are lower\-bound quantifications of prioritization instability, not estimates of expected operational impact\.
A related question is whether the learned prior’s advantage on ANDV reflects genuinely transferable temporal structure or incidental alignment with particular source\-domain timing regimes in D1\. Source\-domain influence diagnostics addressed this concern directly\. Removing the orthopox\-coded/Smallpox group from D1 did not attenuate ANDV performance: MRR increased from 0\.5707 under the recovered full\-D1 protocol to 0\.5951 after orthopox\-coded group removal\. Because the paired confidence interval for this change included zero, this result should be interpreted as robustness evidence rather than as proof that orthopox removal improves performance\. The strongest source\-domain influence was instead observed for MERS: removing MERS reduced ANDV MRR to 0\.4620\. This attenuation indicates that the learned temporal prior partly benefits from source domains with timing structure relevant to longer\-gap transmission regimes\. However, the MERS\-removed model remained above both the random\-ranking expectation and the fair Gaussian baseline, and removing MERS together with the orthopox\-coded group produced the same MRR = 0\.4620\. The appropriate interpretation is therefore not that performance is independent of all source\-domain composition, but that the reported ANDV advantage is not explained by orthopox\-coded exposure or by candidate\-set geometry alone\.
The strict edge\-selection criteria in Algorithm[1](https://arxiv.org/html/2606.30842#algorithm1)intentionally retain only the most reliable documented transmission events, requiring verified onset dates for both parent and child, unique\-parent assignment, and high\-confidence epidemiological classification\. The reportedMRR=0\.571\\mathrm\{MRR\}=0\.571therefore estimates performance on the best\-characterized subset of ANDV transmission events, not on the complete spectrum of transmission\-reconstruction tasks encountered in real outbreak investigations, which also includes ambiguous multi\-parent cases, incomplete temporal records, and lower\-confidence epidemiological links\. The leave\-one\-task\-out sensitivity analysis \(maximum MRR change: 0\.0187\) demonstrates that no individual task dominates reported performance; however, this analysis evaluates benchmark stability, not the selection bias introduced by the inclusion criteria used to construct the benchmark\. Performance on the unrestricted transmission\-reconstruction problem may therefore be lower than the estimates reported here\.
The MPXV transmission\-label uncertainty analysis is based on a single published investigation of one urban outbreak characterized by a dense sexual\-contact network with substantial partner anonymity and phylogenetic resolution constrained by the relatively low mutation rate of MPXV\. Both characteristics may increase label uncertainty beyond levels expected for outbreak datasets more generally\. Rural household\-contact outbreaks with sequential generational transmission, including ANDV, may yield substantially more reliable epidemiological links\. The Guangdong Delta analysis provides an independent observation in a different pathogen and epidemiological setting; however, the broader claim that transmission\-label uncertainty is a structural property of outbreak data rests on cross\-pathogen inductive reasoning rather than evidence from a representative sample of outbreak scenarios\. The MPXV analysis demonstrates the existence of measurable transmission\-label uncertainty in at least one well\-characterized real\-world outbreak; its prevalence across diverse pathogen families, transmission settings, and investigation designs remains an open empirical question\.
## 6Conclusion
A logistic regression temporal transmission prior trained on eleven disease families under leave\-one\-disease\-out cross\-validation achieved a near\-invariant plausibility curve shape across all 55 pairwise fold comparisons \(mean Pearsonr=0\.9917r=0\.9917\), demonstrating that the learned structure reflects a stable regularity in transmission timing, not disease\-specific interval patterns\. Applied without refitting to the strict ANDV parent\-ranking benchmark, the locked prior improved MRR from 0\.274 to 0\.571 relative to the best fair source\-trained parametric baseline, with all permutationppvalues at or below 0\.000200 and a maximum leave\-one\-task\-out MRR shift of 0\.0187; the advantage is therefore not attributable to individual influential tasks\. Source\-domain sensitivity analyses further showed that this result was not attenuated by removing the orthopox\-coded D1 group and remained above Gaussian and random\-ranking baselines even after removing the most influential source group, MERS\. A pilot evaluation on a reconstructed Sudan virus disease transmission network confirmed that temporal gap proximity carries discriminative ranking signal under relative\-time benchmark conditions; validating the locked prior on SVD requires a dedicated absolute\-onset benchmark\. A systematic label audit of 75 NYC MPXV epidemiologically linked inter\-host pairs found that 54\.67% \(exact 95% CI: 42\.75–66\.21%\) were unresolved or not genomically supported, illustrating that epidemiological linkage should not automatically be treated as confirmed direct\-transmission ground truth\. Retaining uncertain edges in both the ANDV and Guangdong Delta transmission graphs changed inferred offspring burden, source offspring\-count rankings, and offspring\-concentration Gini coefficients, with top\-5 priority\-set Jaccard similarities of 0\.667 and 0\.429 respectively \(Equation[5](https://arxiv.org/html/2606.30842#S3.E5), Tables[13](https://arxiv.org/html/2606.30842#S4.T13)and[14](https://arxiv.org/html/2606.30842#S4.T14)\)\. Together, these findings argue that outbreak transmission reconstruction should move toward uncertainty\-aware frameworks: not because uncertainty is unmanageable, but because ignoring it changes the answer\.
## References
- \[1\]S\. Akther, M\. Su, J\. C\. Wang, H\. Amin, F\. Taki, N\. De La Cruz, M\. Chowdhury, T\. Clabby, E\. Kopping, V\. E\. Ruiz, M\. Leelawong, J\. Latash, K\. Johnson, J\. Baumgartner, M\. Wong, A\. Olsen, R\. C\. Fowler, J\. E\. Pekar, J\. L\. Havens, T\. I\. Vasylyeva, J\. O\. Wertheim, S\. Hughes, and E\. Omoregie\(2025\)Genomic epidemiology of mpox virus during the 2022 outbreak in new york city\.Nature Communications16,pp\. 8354\.External Links:[Document](https://dx.doi.org/10.1038/s41467-025-60486-x)Cited by:[§1](https://arxiv.org/html/2606.30842#S1.p1.1),[§2](https://arxiv.org/html/2606.30842#S2.SS0.SSS0.Px3.p1.1),[§3\.12](https://arxiv.org/html/2606.30842#S3.SS12.p1.1),[§3\.9\.1](https://arxiv.org/html/2606.30842#S3.SS9.SSS1.p1.1),[§5\.3](https://arxiv.org/html/2606.30842#S5.SS3.p4.1)\.
- \[2\]\(2018\)Outbreaker2: a modular platform for outbreak reconstruction\.BMC bioinformatics19\(Suppl 11\),pp\. 363\.Cited by:[§1](https://arxiv.org/html/2606.30842#S1.p2.1),[§2](https://arxiv.org/html/2606.30842#S2.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2606.30842#S2.T1.1.8.5.5.1.1),[§4\.15](https://arxiv.org/html/2606.30842#S4.SS15.p1.1),[§5\.2](https://arxiv.org/html/2606.30842#S5.SS2.p1.1)\.
- \[3\]J\. Carson, M\. Keeling, P\. Ribeca, and X\. Didelot\(2025\)Incorporating epidemiological data into the genomic analysis of partially sampled infectious disease outbreaks\.Molecular Biology and Evolution42\(4\),pp\. msaf083\.External Links:[Document](https://dx.doi.org/10.1093/molbev/msaf083)Cited by:[§2](https://arxiv.org/html/2606.30842#S2.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2606.30842#S2.T1.1.11.8.5.1.1),[§4\.15](https://arxiv.org/html/2606.30842#S4.SS15.p1.1),[§5\.2](https://arxiv.org/html/2606.30842#S5.SS2.p1.1)\.
- \[4\]C\. Colijn, M\. Hall, and R\. Bouckaert\(2024\-07\)Taking a BREATH \(Bayesian reconstruction and evolutionary analysis of transmission histories\) to simultaneously infer phylogenetic and transmission trees for partially sampled outbreaks\.bioRxiv\.Note:PreprintExternal Links:[Document](https://dx.doi.org/10.1101/2024.07.11.603095),[Link](https://doi.org/10.1101/2024.07.11.603095)Cited by:[§2](https://arxiv.org/html/2606.30842#S2.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2606.30842#S2.T1.1.13.10.5.1.1)\.
- \[5\]A\. Cori, N\. M\. Ferguson, C\. Fraser, and S\. Cauchemez\(2013\)A new framework and software to estimate time\-varying reproduction numbers during epidemics\.American Journal of Epidemiology178\(9\),pp\. 1505–1512\.External Links:[Document](https://dx.doi.org/10.1093/aje/kwt133)Cited by:[§2](https://arxiv.org/html/2606.30842#S2.SS0.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2606.30842#S2.T1.1.4.1.5.1.1)\.
- \[6\]N\. De Maio, C\. J\. Worby, D\. J\. Wilson, and N\. Stoesser\(2018\)Bayesian reconstruction of transmission within outbreaks using genomic variants\.PLoS computational biology14\(4\),pp\. e1006117\.Cited by:[§2](https://arxiv.org/html/2606.30842#S2.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2606.30842#S2.T1.1.11.8.5.1.1)\.
- \[7\]N\. De Maio, C\. Wu, and D\. J\. Wilson\(2016\)SCOTTI: efficient reconstruction of transmission within outbreaks with the structured coalescent\.PLoS computational biology12\(9\),pp\. e1005130\.Cited by:[§1](https://arxiv.org/html/2606.30842#S1.p2.1),[§2](https://arxiv.org/html/2606.30842#S2.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2606.30842#S2.T1.1.9.6.5.1.1),[§4\.15](https://arxiv.org/html/2606.30842#S4.SS15.p1.1),[§5\.2](https://arxiv.org/html/2606.30842#S5.SS2.p1.1)\.
- \[8\]X\. Didelot, C\. Fraser, J\. Gardy, and C\. Colijn\(2017\)Genomic infectious disease epidemiology in partially sampled and ongoing outbreaks\.Molecular biology and evolution34\(4\),pp\. 997–1007\.Cited by:[§1](https://arxiv.org/html/2606.30842#S1.p2.1),[§2](https://arxiv.org/html/2606.30842#S2.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2606.30842#S2.T1.1.10.7.5.1.1),[§5\.2](https://arxiv.org/html/2606.30842#S5.SS2.p1.1)\.
- \[9\]X\. Didelot, J\. Gardy, and C\. Colijn\(2014\)Bayesian inference of infectious disease transmission from whole\-genome sequence data\.Molecular biology and evolution31\(7\),pp\. 1869–1879\.Cited by:[§1](https://arxiv.org/html/2606.30842#S1.p2.1),[§2](https://arxiv.org/html/2606.30842#S2.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2606.30842#S2.T1.1.8.5.5.1.1)\.
- \[10\]T\. Ganyani, C\. Kremer, D\. Chen, A\. Torneri, C\. Faes, J\. Wallinga, and N\. Hens\(2020\)Estimating the generation interval for coronavirus disease \(covid\-19\) based on symptom onset data, march 2020\.Eurosurveillance25\(17\),pp\. 2000257\.Cited by:[§2](https://arxiv.org/html/2606.30842#S2.SS0.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2606.30842#S2.T1.1.5.2.5.1.1)\.
- \[11\]S\. Hadjisotiriou, V\. Marchau, W\. Walker, and M\. O\. Rikkert\(2023\)Decision making under deep uncertainty for pandemic policy planning\.Health policy133,pp\. 104831\.Cited by:[§1](https://arxiv.org/html/2606.30842#S1.p2.1),[§2](https://arxiv.org/html/2606.30842#S2.SS0.SSS0.Px4.p1.1),[Table 1](https://arxiv.org/html/2606.30842#S2.T1.1.1.5.1.1)\.
- \[12\]Z\. Kabami, A\. R\. Ario, J\. R\. Harris, M\. Ninsiima, S\. R\. Ahirirwe, J\. R\. A\. Ocero, D\. Atwine, H\. G\. Mwebesa, D\. J\. Kyabayinze, A\. N\. Muruta,et al\.\(2024\)Ebola disease outbreak caused by the sudan virus in uganda, 2022: a descriptive epidemiological study\.The Lancet Global Health12\(10\),pp\. e1684–e1692\.External Links:[Document](https://dx.doi.org/10.1016/S2214-109X%2824%2900260-2)Cited by:[§3\.7\.1](https://arxiv.org/html/2606.30842#S3.SS7.SSS1.p1.1)\.
- \[13\]E\. Kenah, T\. Britton, M\. E\. Halloran, and I\. M\. Longini Jr\(2016\)Molecular infectious disease epidemiology: survival analysis and algorithms linking phylogenies to transmission trees\.PLoS computational biology12\(4\),pp\. e1004869\.Cited by:[§2](https://arxiv.org/html/2606.30842#S2.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2606.30842#S2.T1.1.10.7.5.1.1)\.
- \[14\]A\. A\. King, M\. Domenech de Cellès, F\. M\. Magpantay, and P\. Rohani\(2015\)Avoidable errors in the modelling of outbreaks of emerging pathogens, with special reference to ebola\.Proceedings of the Royal Society B: Biological Sciences282\(1806\)\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1098/rspb.2015.0347)Cited by:[§1](https://arxiv.org/html/2606.30842#S1.p2.1),[§2](https://arxiv.org/html/2606.30842#S2.SS0.SSS0.Px3.p1.1)\.
- \[15\]A\. Komakech, S\. L\. M\. Whitmer, J\. Izudi, C\. Kizito, M\. Ninsiima, S\. R\. Ahirirwe, Z\. Kabami, A\. R\. Ario,et al\.\(2024\)Sudan virus disease super\-spreading, uganda, 2022\.BMC Infectious Diseases24,pp\. 520\.External Links:[Document](https://dx.doi.org/10.1186/s12879-024-09391-0)Cited by:[§1](https://arxiv.org/html/2606.30842#S1.p1.1),[§3\.7\.1](https://arxiv.org/html/2606.30842#S3.SS7.SSS1.p1.1)\.
- \[16\]B\. Li, A\. Deng, K\. Li, Y\. Hu, Z\. Li, Y\. Shi, Q\. Xiong, Z\. Liu, Q\. Guo, L\. Zou,et al\.\(2022\)Viral infection and transmission in a large, well\-traced outbreak caused by the sars\-cov\-2 delta variant\.Nature communications13\(1\),pp\. 460\.Cited by:[§3\.10\.1](https://arxiv.org/html/2606.30842#S3.SS10.SSS1.p1.1)\.
- \[17\]V\. P\. Martínez, N\. Di Paola, D\. O\. Alonso, U\. Pérez\-Sautu, C\. M\. Bellomo, A\. A\. Iglesias, R\. M\. Coelho, B\. López, N\. Periolo, P\. A\. Larson,et al\.\(2020\)“Super\-spreaders” and person\-to\-person transmission of andes virus in argentina\.New England Journal of Medicine383\(23\),pp\. 2230–2241\.Cited by:[§1](https://arxiv.org/html/2606.30842#S1.p1.1),[§3\.5\.1](https://arxiv.org/html/2606.30842#S3.SS5.SSS1.Px1),[§5\.3](https://arxiv.org/html/2606.30842#S5.SS3.p1.1)\.
- \[18\]B\. Nikolay, H\. Salje, M\. J\. Hossain, A\.K\.M\. D\. Khan, H\. M\.S\. Sazzad, M\. Rahman, P\. Daszak, E\. S\. Gurley,et al\.\(2019\-05\)Transmission of nipah virus — 14 years of investigations in bangladesh\.New England Journal of Medicine380\(19\),pp\. 1804–1814\.External Links:[Document](https://dx.doi.org/10.1056/NEJMoa1805376),[Link](https://doi.org/10.1056/NEJMoa1805376)Cited by:[§2](https://arxiv.org/html/2606.30842#S2.SS0.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2606.30842#S2.T1.1.5.2.5.1.1),[§3\.2\.1](https://arxiv.org/html/2606.30842#S3.SS2.SSS1.p1.1),[§5\.3](https://arxiv.org/html/2606.30842#S5.SS3.p2.1)\.
- \[19\]I\. Specht, G\. K\. Moreno, T\. Brock\-Fisher, L\. A\. Krasilnikova, B\. A\. Petros, J\. E\. Pekar, M\. Schifferli, B\. Fry, C\. M\. Brown, L\. C\. Madoff, M\. Burns, S\. F\. Schaffner, D\. J\. Park, B\. L\. MacInnis, A\. Ozonoff, P\. Varilly, M\. D\. Mitzenmacher, and P\. C\. Sabeti\(2025\-03\)JUNIPER: reconstructing transmission events from next\-generation sequencing data at scale\.Research Square\.Note:PreprintExternal Links:[Document](https://dx.doi.org/10.21203/rs.3.rs-6264999/v1),[Link](https://doi.org/10.21203/rs.3.rs-6264999/v1)Cited by:[§2](https://arxiv.org/html/2606.30842#S2.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2606.30842#S2.T1.1.13.10.5.1.1)\.
- \[20\]J\. C\. Taube, P\. B\. Miller, and J\. M\. Drake\(2022\)An open\-access database of infectious disease transmission trees to explore superspreader epidemiology\.PLOS Biology20\(6\),pp\. e3001685\.External Links:[Document](https://dx.doi.org/10.1371/journal.pbio.3001685),[Link](https://doi.org/10.1371/journal.pbio.3001685)Cited by:[§2](https://arxiv.org/html/2606.30842#S2.SS0.SSS0.Px3.p1.1)\.
- \[21\]A\. J\. Vickers and E\. B\. Elkin\(2006\)Decision curve analysis: a novel method for evaluating prediction models\.Medical Decision Making26\(6\),pp\. 565–574\.Cited by:[§2](https://arxiv.org/html/2606.30842#S2.SS0.SSS0.Px4.p1.1),[Table 1](https://arxiv.org/html/2606.30842#S2.T1.1.1.5.1.1)\.
- \[22\]H\. Waddel, K\. Koelle, and M\. S\. Y\. Lau\(2025\)ScITree: scalable bayesian inference of transmission tree from epidemiological and genomic data\.PLOS Computational Biology21\(6\),pp\. e1012657\.External Links:[Document](https://dx.doi.org/10.1371/journal.pcbi.1012657),[Link](https://doi.org/10.1371/journal.pcbi.1012657)Cited by:[§2](https://arxiv.org/html/2606.30842#S2.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2606.30842#S2.T1.1.12.9.5.1.1)\.
- \[23\]J\. Wallinga and M\. Lipsitch\(2007\)How generation intervals shape the relationship between growth rates and reproductive numbers\.Proceedings of the Royal Society B: Biological Sciences274\(1609\),pp\. 599–604\.Cited by:[§2](https://arxiv.org/html/2606.30842#S2.SS0.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2606.30842#S2.T1.1.5.2.5.1.1),[Table 1](https://arxiv.org/html/2606.30842#S2.T1.1.6.3.5.1.1)\.
- \[24\]J\. Wallinga and P\. Teunis\(2004\)Different epidemic curves for severe acute respiratory syndrome reveal similar impacts of control measures\.American Journal of Epidemiology160\(6\),pp\. 509–516\.External Links:[Document](https://dx.doi.org/10.1093/aje/kwh255)Cited by:[§2](https://arxiv.org/html/2606.30842#S2.SS0.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2606.30842#S2.T1.1.4.1.5.1.1),[Table 1](https://arxiv.org/html/2606.30842#S2.T1.1.5.2.5.1.1),[Table 1](https://arxiv.org/html/2606.30842#S2.T1.1.6.3.5.1.1),[Table 1](https://arxiv.org/html/2606.30842#S2.T1.1.7.4.5.1.1)\.
- \[25\]M\. Walsh, S\. K\. Srinathan, D\. F\. McAuley, M\. Mrkobrada, O\. Levine, C\. Ribic,et al\.\(2014\)The statistical significance of randomized controlled trial results is frequently fragile: a case for a fragility index\.Journal of Clinical Epidemiology67\(6\),pp\. 622–628\.Cited by:[§2](https://arxiv.org/html/2606.30842#S2.SS0.SSS0.Px4.p1.1),[Table 1](https://arxiv.org/html/2606.30842#S2.T1.1.14.11.5.1.1)\.
- \[26\]R\. L\. Wasserstein and N\. A\. Lazar\(2016\)The asa’s statement on p\-values: context, process, and purpose\.The American Statistician70\(2\),pp\. 129–133\.Cited by:[§2](https://arxiv.org/html/2606.30842#S2.SS0.SSS0.Px4.p1.1),[Table 1](https://arxiv.org/html/2606.30842#S2.T1.1.14.11.5.1.1)\.Similar Articles
Uncertainty-Aware Longitudinal Forecasting of Alzheimer's Disease Progression Using Deep Learning
This paper proposes a probabilistic framework for Alzheimer's disease progression forecasting that combines ordinal diagnosis prediction, multi-horizon trajectory generation, and decomposed uncertainty estimation using a Temporal Fusion Transformer encoder and an autoregressive Mixture Density Network. The model outperforms baselines on ADNI data, achieving near-nominal 90% credible interval coverage with clinically meaningful uncertainty signals.
Uncertainty Quantification for Large Language Diffusion Models
This paper presents the first systematic study of uncertainty quantification (UQ) for Large Language Diffusion Models (LLDMs), proposing lightweight zero-shot uncertainty signals derived from the iterative denoising process and showing that LLDMs can achieve both fast inference and reliable hallucination detection with up to 100x lower computational overhead compared to sampling-based baselines.
PRB-RUPFormer: A Recursive Unified Probabilistic Transformer for Residual PRB Forecasting
Proposes PRB-RUPFormer, a recursive unified probabilistic Transformer for forecasting residual Physical Resource Blocks in cellular networks, achieving high accuracy and uncertainty quantification on commercial LTE data.
Who Should Lead Decoding Now? Tracking Reliable Trajectories for Ensembling Masked Diffusion Language Models
This paper proposes TIE, a knowledge fusion framework for masked diffusion language models that tracks confidence dynamics to identify reliable decoding trajectories and iteratively transfers partially denoised sequences between models, improving generation quality on reasoning tasks.
Learned Relay Representations for Forward-Thinking Discrete Diffusion Models
This paper introduces Learned Relay Representations (Relay), a method that allows masked diffusion models to propagate latent information across denoising steps, overcoming the hard reset problem and improving performance-latency trade-offs. The method is shown to outperform standard supervised finetuning on coding tasks while reducing inference latency by up to 32%.