迷失於摺疊之中:當交叉驗證不再是深度集成的不確定性估計
Lost in the Folds: When Cross-Validation Is Not a Deep Ensemble for Uncertainty Estimation
May 18, 2026
作者: Kirscher Tristan, Bujotzek Markus, Kirchhoff Yannick, Rokuss Maximilian, Isensee Fabian, Kahl Kim-Celine, Kovacs Balint, Maier-Hein Klaus
cs.AI
摘要
集成不一致性被广泛用作医学图像分割中认知不确定性的代理指标。实践中,许多研究通过K折交叉验证形成集成,却将其称为"深度集成"。由于交叉验证成员在不同数据子集上训练,它们的不一致性混合了种子驱动变异性和数据暴露效应,这会改变对不确定性的解读方式。我们审核了近期分割不确定性研究,发现术语与实现之间的不匹配现象普遍存在。随后,我们在三个涵盖三种模态的多评分者分割数据集上,以其他配置完全相同的条件下,比较了标准5折交叉验证集成与5成员深度集成(固定训练集、不同随机种子)的表现。我们从校准、故障检测、模糊性建模及分布偏移下的鲁棒性四个维度评估不确定性。深度集成在保持分割精度的同时提升了校准与故障检测性能,而交叉验证集成在特定数据集上与评分者间变异性呈现更强的相关性。因此,集成构建方法应与研究问题相匹配:深度集成适用于可靠性导向场景(如选择性转诊/故障检测),交叉验证集成则可作为模糊性的代理指标。我们提供了轻量级nnU-Net修改方案,使其能在默认流程中支持深度集成训练。
English
Ensemble disagreement is widely used as a proxy for epistemic uncertainty in medical image segmentation. In practice, many studies form ensembles via K-fold cross-validation (CV), yet refer to them as ``deep ensembles'' (DE). Because CV members are trained on different data subsets, their disagreement mixes seed-driven variability with data-exposure effects, which can change how uncertainty should be interpreted. We audit recent segmentation uncertainty studies and find that terminology--implementation mismatches are common. We then compare a standard 5-fold CV ensemble to a 5-member DE (fixed training set, different random seeds) under otherwise identical configurations on three multi-rater segmentation datasets spanning three modalities. We evaluate uncertainty for calibration, failure detection, ambiguity modeling, and robustness under distribution shift. DE match segmentation accuracy while improving calibration and failure detection, whereas CV ensembles sometimes correlate more strongly with inter-rater variability on the studied datasets. Thus, ensemble construction should be chosen to match the research question: DE for reliability-oriented use (e.g., selective referral/failure detection) and CV ensembles as a proxy for ambiguity. We provide a lightweight nnU-Net modification enabling DE training within the default pipeline.