MedDINOv3: 医療画像セグメンテーションのための視覚基盤モデルの適応方法

要旨

CTおよびMRIスキャンにおける臓器や腫瘍の正確なセグメンテーションは、診断、治療計画、疾患モニタリングにおいて不可欠です。深層学習は自動セグメンテーションを進化させてきましたが、ほとんどのモデルはタスク固有であり、モダリティや機関を超えた汎用性に欠けています。ビジョンファウンデーションモデル（FMs）は、数十億規模の自然画像で事前学習されており、強力で転移可能な表現を提供します。しかし、医療画像への適応には2つの主要な課題があります：(1) ほとんどのファウンデーションモデルのViTバックボーンは、医療画像セグメンテーションにおいて専門的なCNNにまだ及ばないこと、(2) 自然画像と医療画像の間の大きなドメインギャップが転移可能性を制限することです。本論文では、DINOv3を医療セグメンテーションに適応させるためのシンプルで効果的なフレームワークであるMedDINOv3を紹介します。まず、プレーンなViTを再検討し、マルチスケールトークン集約を備えたシンプルで効果的なアーキテクチャを設計します。次に、3.87Mの軸方向CTスライスを精選したCT-3Mデータセットを用いて、ドメイン適応型事前学習を多段階のDINOv3レシピで行い、ロバストな密な特徴を学習します。MedDINOv3は、4つのセグメンテーションベンチマークにおいて、最先端の性能を達成または上回り、ビジョンファウンデーションモデルが医療画像セグメンテーションの統一バックボーンとしての可能性を示しています。コードはhttps://github.com/ricklisz/MedDINOv3で公開されています。

English

Accurate segmentation of organs and tumors in CT and MRI scans is essential for diagnosis, treatment planning, and disease monitoring. While deep learning has advanced automated segmentation, most models remain task-specific, lacking generalizability across modalities and institutions. Vision foundation models (FMs) pretrained on billion-scale natural images offer powerful and transferable representations. However, adapting them to medical imaging faces two key challenges: (1) the ViT backbone of most foundation models still underperform specialized CNNs on medical image segmentation, and (2) the large domain gap between natural and medical images limits transferability. We introduce MedDINOv3, a simple and effective framework for adapting DINOv3 to medical segmentation. We first revisit plain ViTs and design a simple and effective architecture with multi-scale token aggregation. Then, we perform domain-adaptive pretraining on CT-3M, a curated collection of 3.87M axial CT slices, using a multi-stage DINOv3 recipe to learn robust dense features. MedDINOv3 matches or exceeds state-of-the-art performance across four segmentation benchmarks, demonstrating the potential of vision foundation models as unified backbones for medical image segmentation. The code is available at https://github.com/ricklisz/MedDINOv3.

MedDINOv3: 医療画像セグメンテーションのための視覚基盤モデルの適応方法

MedDINOv3: How to adapt vision foundation models for medical image segmentation?

要旨

Support