MedDINOv3：如何将视觉基础模型适配于医学图像分割？

摘要

在CT和MRI扫描中精确分割器官与肿瘤对于诊断、治疗规划及疾病监测至关重要。尽管深度学习已推动了自动化分割技术的进步，但多数模型仍局限于特定任务，缺乏跨模态与跨机构的泛化能力。基于数十亿规模自然图像预训练的视觉基础模型（FMs）提供了强大且可迁移的表征能力。然而，将其应用于医学影像面临两大挑战：（1）多数基础模型采用的ViT骨干网络在医学图像分割上仍逊色于专用CNN；（2）自然图像与医学图像间巨大的领域差异限制了迁移效果。我们提出了MedDINOv3，一个简单而有效的框架，用于将DINOv3适配至医学分割任务。首先，我们重新审视了普通ViT，并设计了一个包含多尺度令牌聚合的简洁高效架构。随后，我们在CT-3M——一个精心挑选的包含387万张轴向CT切片的集合上，采用多阶段DINOv3配方进行领域自适应预训练，以学习鲁棒的密集特征。MedDINOv3在四个分割基准测试中达到或超越了当前最优性能，展示了视觉基础模型作为医学图像分割统一骨干的潜力。代码已发布于https://github.com/ricklisz/MedDINOv3。

English

Accurate segmentation of organs and tumors in CT and MRI scans is essential for diagnosis, treatment planning, and disease monitoring. While deep learning has advanced automated segmentation, most models remain task-specific, lacking generalizability across modalities and institutions. Vision foundation models (FMs) pretrained on billion-scale natural images offer powerful and transferable representations. However, adapting them to medical imaging faces two key challenges: (1) the ViT backbone of most foundation models still underperform specialized CNNs on medical image segmentation, and (2) the large domain gap between natural and medical images limits transferability. We introduce MedDINOv3, a simple and effective framework for adapting DINOv3 to medical segmentation. We first revisit plain ViTs and design a simple and effective architecture with multi-scale token aggregation. Then, we perform domain-adaptive pretraining on CT-3M, a curated collection of 3.87M axial CT slices, using a multi-stage DINOv3 recipe to learn robust dense features. MedDINOv3 matches or exceeds state-of-the-art performance across four segmentation benchmarks, demonstrating the potential of vision foundation models as unified backbones for medical image segmentation. The code is available at https://github.com/ricklisz/MedDINOv3.

MedDINOv3：如何将视觉基础模型适配于医学图像分割？

MedDINOv3: How to adapt vision foundation models for medical image segmentation?

摘要

Support