DINOv3は医療ビジョンの新たな基準を確立するか？

要旨

大規模な視覚基盤モデルの登場は、多様な自然画像で事前学習されたことにより、コンピュータビジョンにおけるパラダイムシフトをもたらしました。しかし、最先端の視覚基盤モデルの有効性が、医療画像などの専門領域にどのように転移するかは未解決の問題です。本報告では、密な予測タスクにおいて強力な能力を発揮する最新の自己教師あり視覚トランスフォーマー（ViT）であるDINOv3が、ドメイン固有の事前学習なしに、医療視覚タスクの統一的なエンコーダーとして直接利用できるかどうかを調査します。これを検証するため、DINOv3を2D/3D分類やセグメンテーションなど、幅広い医療画像モダリティにわたる一般的な医療視覚タスクでベンチマークしました。モデルサイズや入力画像解像度を変えることで、そのスケーラビリティを体系的に分析しました。その結果、DINOv3は印象的な性能を示し、新たな強力なベースラインを確立することが明らかになりました。特に、自然画像のみで学習されたにもかかわらず、BiomedCLIPやCT-Netなどの医療特化型基盤モデルをいくつかのタスクで上回ることも確認されました。しかし、明確な限界も存在します。例えば、Whole-Slide Pathological Images（WSIs）、電子顕微鏡（EM）、陽電子放射断層撮影（PET）など、深いドメイン特化を必要とするシナリオでは、モデルの特徴が劣化します。さらに、DINOv3は医療領域においてスケーリング則に一貫して従わず、より大きなモデルや細かい特徴解像度で性能が必ずしも向上しないことが観察され、タスク間で多様なスケーリング挙動を示しました。最終的に、本研究はDINOv3を強力なベースラインとして確立し、その強力な視覚特徴が複雑な医療タスクに対する堅牢な事前知識として機能することを示しました。これにより、3D再構成におけるマルチビュー一貫性を強化するためにその特徴を活用するなど、将来の有望な研究方向が開かれます。

English

The advent of large-scale vision foundation models, pre-trained on diverse natural images, has marked a paradigm shift in computer vision. However, how the frontier vision foundation models' efficacies transfer to specialized domains remains such as medical imaging remains an open question. This report investigates whether DINOv3, a state-of-the-art self-supervised vision transformer (ViT) that features strong capability in dense prediction tasks, can directly serve as a powerful, unified encoder for medical vision tasks without domain-specific pre-training. To answer this, we benchmark DINOv3 across common medical vision tasks, including 2D/3D classification and segmentation on a wide range of medical imaging modalities. We systematically analyze its scalability by varying model sizes and input image resolutions. Our findings reveal that DINOv3 shows impressive performance and establishes a formidable new baseline. Remarkably, it can even outperform medical-specific foundation models like BiomedCLIP and CT-Net on several tasks, despite being trained solely on natural images. However, we identify clear limitations: The model's features degrade in scenarios requiring deep domain specialization, such as in Whole-Slide Pathological Images (WSIs), Electron Microscopy (EM), and Positron Emission Tomography (PET). Furthermore, we observe that DINOv3 does not consistently obey scaling law in the medical domain; performance does not reliably increase with larger models or finer feature resolutions, showing diverse scaling behaviors across tasks. Ultimately, our work establishes DINOv3 as a strong baseline, whose powerful visual features can serve as a robust prior for multiple complex medical tasks. This opens promising future directions, such as leveraging its features to enforce multiview consistency in 3D reconstruction.

DINOv3は医療ビジョンの新たな基準を確立するか？

Does DINOv3 Set a New Medical Vision Standard?

要旨

Support