评估胰腺导管腺癌血管侵犯：PDACVI基准研究

摘要

胰腺导管腺癌（PDAC）的手术切除仍是目前唯一可能实现根治的治疗手段，其手术适应症取决于对血管侵犯（VI）的精准评估，即肿瘤是否侵犯邻近关键血管。尽管VI评估对术前分期和手术规划至关重要，但其计算化评估研究仍处于探索不足的状态。这主要面临两大挑战：公开数据集的匮乏以及肿瘤-血管界面存在的诊断模糊性，即使资深影像专家之间也存在显著的判读差异。为突破这些局限，我们推出CURVAS-PDACVI数据集与挑战赛——一个基于密集标注数据集（每例扫描包含五位专家独立标注）的开放式不确定性感知人工智能基准平台，专注于PDAC分期研究。同时我们提出超越空间重叠度的多维度评估框架，涵盖概率校准与VI评估功能。对六种前沿方法的评估表明，优异的整体体积重叠度未必能转化为临床关键肿瘤-血管界面的可靠性能。特别是针对二值分割优化的方法虽在平均重叠度指标上表现良好，但在专家共识度低的高复杂度病例中往往性能下降，出现体积坍缩或边界过度扩展等问题。相比之下，能模拟专家间分歧的方法可生成更优校准的概率图谱，并在这些模糊病例中展现出更强鲁棒性。该基准揭示了将体积精度作为局部手术适用性代理指标的局限性，为推动不确定性感知概率模型应用于术前决策提供了新方向。

English

Surgical resection remains the only potentially curative treatment for pancreatic ductal adenocarcinoma (PDAC), and eligibility depends on accurate assessment of vascular invasion (VI), i.e., tumor extension into adjacent critical vessels. Despite its importance for preoperative staging and surgical planning, computational VI assessment remains underexplored. Two major challenges are the lack of public datasets and the diagnostic ambiguity at the tumor-vessel interface, which leads to substantial inter-rater variability even among expert radiologists. To address these limitations, we introduce the CURVAS-PDACVI Dataset and Challenge, an open benchmark for uncertainty-aware AI in PDAC staging based on a densely annotated dataset with five independent expert annotations per scan. We also propose a multi-metric evaluation framework that extends beyond spatial overlap to include probabilistic calibration and VI assessment. Evaluation of six state-of-the-art methods shows that strong global volumetric overlap does not necessarily translate into reliable performance at clinically critical tumor-vessel interfaces. In particular, methods optimized for binary segmentation perform competitively on average overlap metrics, but often degrade in high-complexity cases with low expert consensus, either collapsing in volume or overextending at uncertain boundaries. In contrast, methods that model inter-rater disagreement produce better calibrated probabilistic maps and show greater robustness in these ambiguous cases. The benchmark highlights the limitations of volumetric accuracy as a proxy for localized surgical utility, motivating uncertainty-aware probabilistic models for preoperative decision-making.