MedDINOv3：如何將視覺基礎模型應用於醫學影像分割？

摘要

在CT和MRI掃描中精確分割器官和腫瘤對於診斷、治療計劃和疾病監測至關重要。儘管深度學習已推動了自動化分割的進步，但大多數模型仍局限於特定任務，缺乏跨模態和跨機構的通用性。基於億級自然圖像預訓練的視覺基礎模型（FMs）提供了強大且可遷移的特徵表示。然而，將其應用於醫學影像面臨兩個主要挑戰：(1) 大多數基礎模型的ViT骨幹在醫學圖像分割上仍遜色於專用CNN，(2) 自然圖像與醫學圖像之間的大域差距限制了遷移能力。我們提出了MedDINOv3，這是一個簡單有效的框架，用於將DINOv3適應於醫學分割。我們首先重新審視了普通ViT，並設計了一個簡單有效的多尺度令牌聚合架構。隨後，我們在CT-3M（一個包含387萬張軸向CT切片的精選數據集）上進行了域適應性預訓練，採用多階段DINOv3配方來學習魯棒的密集特徵。MedDINOv3在四個分割基準測試中匹配或超越了現有最先進的性能，展示了視覺基礎模型作為醫學圖像分割統一骨幹的潛力。代碼可在https://github.com/ricklisz/MedDINOv3獲取。

English

Accurate segmentation of organs and tumors in CT and MRI scans is essential for diagnosis, treatment planning, and disease monitoring. While deep learning has advanced automated segmentation, most models remain task-specific, lacking generalizability across modalities and institutions. Vision foundation models (FMs) pretrained on billion-scale natural images offer powerful and transferable representations. However, adapting them to medical imaging faces two key challenges: (1) the ViT backbone of most foundation models still underperform specialized CNNs on medical image segmentation, and (2) the large domain gap between natural and medical images limits transferability. We introduce MedDINOv3, a simple and effective framework for adapting DINOv3 to medical segmentation. We first revisit plain ViTs and design a simple and effective architecture with multi-scale token aggregation. Then, we perform domain-adaptive pretraining on CT-3M, a curated collection of 3.87M axial CT slices, using a multi-stage DINOv3 recipe to learn robust dense features. MedDINOv3 matches or exceeds state-of-the-art performance across four segmentation benchmarks, demonstrating the potential of vision foundation models as unified backbones for medical image segmentation. The code is available at https://github.com/ricklisz/MedDINOv3.

MedDINOv3：如何將視覺基礎模型應用於醫學影像分割？

MedDINOv3: How to adapt vision foundation models for medical image segmentation?

摘要

Support