TwinTrack: Post-hoc Multi-Rater Calibration for Medical Image Segmentation

Hugging Face Daily Papers Papers

Summary

TwinTrack is a post-hoc calibration framework for pancreatic cancer segmentation that aligns ensemble model probabilities with the empirical mean human response across multiple annotators, improving interpretability and calibration metrics on multi-rater benchmarks.

Pancreatic ductal adenocarcinoma (PDAC) segmentation on contrast-enhanced CT is inherently ambiguous: inter-rater disagreement among experts reflects genuine uncertainty rather than annotation noise. Standard deep learning approaches assume a single ground truth, producing probabilistic outputs that can be poorly calibrated and difficult to interpret under such ambiguity. We present TwinTrack, a framework that addresses this gap through post-hoc calibration of ensemble segmentation probabilities to the empirical mean human response (MHR) -the fraction of expert annotators labeling a voxel as tumor. Calibrated probabilities are thus directly interpretable as the expected proportion of annotators assigning the tumor label, explicitly modeling inter-rater disagreement. The proposed post-hoc calibration procedure is simple and requires only a small multi-rater calibration set. It consistently improves calibration metrics over standard approaches when evaluated on the MICCAI 2025 CURVAS-PDACVI multi-rater benchmark.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/21/26, 07:21 AM

Paper page - TwinTrack: Post-hoc Multi-Rater Calibration for Medical Image Segmentation

Source: https://huggingface.co/papers/2604.15950

Abstract

TwinTrack framework addresses pancreatic cancer segmentation ambiguity through post-hoc calibration of ensemble probabilities to empirical mean human response, improving calibration metrics on multi-rater benchmarks.

Pancreatic ductal adenocarcinoma (PDAC) segmentation on contrast-enhanced CT is inherently ambiguous:inter-rater disagreementamong experts reflects genuine uncertainty rather than annotation noise. Standard deep learning approaches assume a single ground truth, producingprobabilistic outputsthat can be poorly calibrated and difficult to interpret under such ambiguity. We present TwinTrack, a framework that addresses this gap throughpost-hoc calibrationofensemble segmentationprobabilities to theempirical mean human response(MHR) -the fraction of expert annotators labeling a voxel as tumor. Calibrated probabilities are thus directly interpretable as the expected proportion of annotators assigning the tumor label, explicitly modelinginter-rater disagreement. The proposedpost-hoc calibrationprocedure is simple and requires only a small multi-rater calibration set. It consistently improvescalibration metricsover standard approaches when evaluated on the MICCAI 2025 CURVAS-PDACVI multi-rater benchmark.

View arXiv pageView PDFAdd to collection

Community

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.15950 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.15950 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.15950 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning

arXiv cs.CL

This paper introduces Adaptive Tool Trust Calibration (ATTC), a framework that improves tool-integrated reasoning models by enabling them to adaptively decide when to trust or ignore tool results based on code confidence scores. The approach addresses the "Tool Ignored" problem where models incorrectly dismiss correct tool outputs, achieving 4.1-7.5% performance improvements across multiple models and datasets.

Using GPT-4o reasoning to transform cancer care

OpenAI Blog

Color Health has developed an AI copilot using GPT-4o's reasoning capabilities to help oncologists identify missing diagnostic information and streamline cancer care workflows. The tool enables physicians to find 4x more missing labs and imaging results in ~5 minutes versus weeks, with initial validation underway at UCSF.

Reciprocal Co-Training (RCT): Coupling Gradient-Based and Non-Differentiable Models via Reinforcement Learning

arXiv cs.CL

Researchers from Fordham University introduce Reciprocal Co-Training (RCT), a framework that couples LLMs and Random Forest classifiers via reinforcement learning, creating an iterative feedback loop where each model improves using signals from the other. Experiments on three medical datasets show consistent performance gains for both models, demonstrating a general mechanism for integrating incompatible model families.