双轨追溯：医学图像分割的多评估者后验校准方法

摘要

胰腺导管腺瘤（PDAC）在增强CT图像上的分割存在固有模糊性：专家间评估差异反映的是真实不确定性而非标注噪声。传统深度学习方法假设存在单一标准答案，其生成的概率输出在此类模糊场景下容易出现校准不佳且难以解释的问题。我们提出TwinTrack框架，通过将集成分割概率后验校准到经验性人类平均响应（MHR）——即专家将体素标注为肿瘤的比例，来解决这一缺陷。经校准的概率可直接解释为标注者分配肿瘤标签的预期比例，从而显式建模评估者间差异。所提出的后验校准流程简洁高效，仅需少量多人标注校准集。在MICCAI 2025 CURVAS-PDACVI多人标注基准测试中，该方法持续优化了校准指标表现。

English

Pancreatic ductal adenocarcinoma (PDAC) segmentation on contrast-enhanced CT is inherently ambiguous: inter-rater disagreement among experts reflects genuine uncertainty rather than annotation noise. Standard deep learning approaches assume a single ground truth, producing probabilistic outputs that can be poorly calibrated and difficult to interpret under such ambiguity. We present TwinTrack, a framework that addresses this gap through post-hoc calibration of ensemble segmentation probabilities to the empirical mean human response (MHR) -the fraction of expert annotators labeling a voxel as tumor. Calibrated probabilities are thus directly interpretable as the expected proportion of annotators assigning the tumor label, explicitly modeling inter-rater disagreement. The proposed post-hoc calibration procedure is simple and requires only a small multi-rater calibration set. It consistently improves calibration metrics over standard approaches when evaluated on the MICCAI 2025 CURVAS-PDACVI multi-rater benchmark.