TwinTrack: Post-hoc Multi-Rater Calibration for Medical Image Segmentation
Summary
TwinTrack is a post-hoc calibration framework for pancreatic cancer segmentation that aligns ensemble model probabilities with the empirical mean human response across multiple annotators, improving interpretability and calibration metrics on multi-rater benchmarks.
View Cached Full Text
Cached at: 04/21/26, 07:21 AM
Paper page - TwinTrack: Post-hoc Multi-Rater Calibration for Medical Image Segmentation
Source: https://huggingface.co/papers/2604.15950
Abstract
TwinTrack framework addresses pancreatic cancer segmentation ambiguity through post-hoc calibration of ensemble probabilities to empirical mean human response, improving calibration metrics on multi-rater benchmarks.
Pancreatic ductal adenocarcinoma (PDAC) segmentation on contrast-enhanced CT is inherently ambiguous:inter-rater disagreementamong experts reflects genuine uncertainty rather than annotation noise. Standard deep learning approaches assume a single ground truth, producingprobabilistic outputsthat can be poorly calibrated and difficult to interpret under such ambiguity. We present TwinTrack, a framework that addresses this gap throughpost-hoc calibrationofensemble segmentationprobabilities to theempirical mean human response(MHR) -the fraction of expert annotators labeling a voxel as tumor. Calibrated probabilities are thus directly interpretable as the expected proportion of annotators assigning the tumor label, explicitly modelinginter-rater disagreement. The proposedpost-hoc calibrationprocedure is simple and requires only a small multi-rater calibration set. It consistently improvescalibration metricsover standard approaches when evaluated on the MICCAI 2025 CURVAS-PDACVI multi-rater benchmark.
View arXiv pageView PDFAdd to collection
Community
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.15950 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.15950 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.15950 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning
This paper introduces Adaptive Tool Trust Calibration (ATTC), a framework that improves tool-integrated reasoning models by enabling them to adaptively decide when to trust or ignore tool results based on code confidence scores. The approach addresses the "Tool Ignored" problem where models incorrectly dismiss correct tool outputs, achieving 4.1-7.5% performance improvements across multiple models and datasets.
New AI model spots pancreatic cancer up to 3 years earlier than human doctors in test
A new AI model (REDMOD) can detect pancreatic cancer up to three years earlier than human doctors by analyzing CT scans for subtle irregularities, potentially improving early diagnosis and survival rates.
@Mayank_022: I tested @huggingface ml-intern, given the prompt "Fine-tune a Segment Anything Model (SAM) on a useful medical dataset…
A user evaluated Hugging Face's ml-intern tool by requesting it to fine-tune SAM on a medical dataset and produce both a Jupyter notebook tutorial and a blog post.
Using GPT-4o reasoning to transform cancer care
Color Health has developed an AI copilot using GPT-4o's reasoning capabilities to help oncologists identify missing diagnostic information and streamline cancer care workflows. The tool enables physicians to find 4x more missing labs and imaging results in ~5 minutes versus weeks, with initial validation underway at UCSF.
Reciprocal Co-Training (RCT): Coupling Gradient-Based and Non-Differentiable Models via Reinforcement Learning
Researchers from Fordham University introduce Reciprocal Co-Training (RCT), a framework that couples LLMs and Random Forest classifiers via reinforcement learning, creating an iterative feedback loop where each model improves using signals from the other. Experiments on three medical datasets show consistent performance gains for both models, demonstrating a general mechanism for integrating incompatible model families.