TwinTrack: Post-hoc Multi-Rater Calibration for Medical Image Segmentation

Hugging Face Daily Papers 04/17/26, 12:00 AM Papers

medical-imaging segmentation calibration deep-learning uncertainty multi-rater cancer-detection

Summary

TwinTrack is a post-hoc calibration framework for pancreatic cancer segmentation that aligns ensemble model probabilities with the empirical mean human response across multiple annotators, improving interpretability and calibration metrics on multi-rater benchmarks.

Pancreatic ductal adenocarcinoma (PDAC) segmentation on contrast-enhanced CT is inherently ambiguous: inter-rater disagreement among experts reflects genuine uncertainty rather than annotation noise. Standard deep learning approaches assume a single ground truth, producing probabilistic outputs that can be poorly calibrated and difficult to interpret under such ambiguity. We present TwinTrack, a framework that addresses this gap through post-hoc calibration of ensemble segmentation probabilities to the empirical mean human response (MHR) -the fraction of expert annotators labeling a voxel as tumor. Calibrated probabilities are thus directly interpretable as the expected proportion of annotators assigning the tumor label, explicitly modeling inter-rater disagreement. The proposed post-hoc calibration procedure is simple and requires only a small multi-rater calibration set. It consistently improves calibration metrics over standard approaches when evaluated on the MICCAI 2025 CURVAS-PDACVI multi-rater benchmark.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 04/21/26, 07:21 AM

Paper page - TwinTrack: Post-hoc Multi-Rater Calibration for Medical Image Segmentation

Source: https://huggingface.co/papers/2604.15950

Abstract

TwinTrack framework addresses pancreatic cancer segmentation ambiguity through post-hoc calibration of ensemble probabilities to empirical mean human response, improving calibration metrics on multi-rater benchmarks.

Pancreatic ductal adenocarcinoma (PDAC) segmentation on contrast-enhanced CT is inherently ambiguous:inter-rater disagreementamong experts reflects genuine uncertainty rather than annotation noise. Standard deep learning approaches assume a single ground truth, producingprobabilistic outputsthat can be poorly calibrated and difficult to interpret under such ambiguity. We present TwinTrack, a framework that addresses this gap throughpost-hoc calibrationofensemble segmentationprobabilities to theempirical mean human response(MHR) -the fraction of expert annotators labeling a voxel as tumor. Calibrated probabilities are thus directly interpretable as the expected proportion of annotators assigning the tumor label, explicitly modelinginter-rater disagreement. The proposedpost-hoc calibrationprocedure is simple and requires only a small multi-rater calibration set. It consistently improvescalibration metricsover standard approaches when evaluated on the MICCAI 2025 CURVAS-PDACVI multi-rater benchmark.

View arXiv page View PDF Add to collection

Community

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.15950 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.15950 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.15950 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

TwinTrack: Post-hoc Multi-Rater Calibration for Medical Image Segmentation

Paper page - TwinTrack: Post-hoc Multi-Rater Calibration for Medical Image Segmentation

Abstract

Community

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning

New AI model spots pancreatic cancer up to 3 years earlier than human doctors in test

@Mayank_022: I tested @huggingface ml-intern, given the prompt "Fine-tune a Segment Anything Model (SAM) on a useful medical dataset…

Using GPT-4o reasoning to transform cancer care

Reciprocal Co-Training (RCT): Coupling Gradient-Based and Non-Differentiable Models via Reinforcement Learning

Submit Feedback

Similar Articles

When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning

New AI model spots pancreatic cancer up to 3 years earlier than human doctors in test

@Mayank_022: I tested @huggingface ml-intern, given the prompt "Fine-tune a Segment Anything Model (SAM) on a useful medical dataset…

Using GPT-4o reasoning to transform cancer care

Reciprocal Co-Training (RCT): Coupling Gradient-Based and Non-Differentiable Models via Reinforcement Learning