DINOv3是否為醫學視覺設立了新標準？

摘要

大規模視覺基礎模型的出現，這些模型在多元的自然圖像上進行預訓練，標誌著電腦視覺領域的一次範式轉變。然而，這些前沿視覺基礎模型在專業領域，如醫學影像中的效能轉移，仍是一個未解之謎。本報告探討了DINOv3——一種在密集預測任務中展現出強大能力的最新自監督視覺變換器（ViT）——是否能夠直接作為醫學視覺任務的強大統一編碼器，而無需進行特定領域的預訓練。為此，我們在常見的醫學視覺任務上對DINOv3進行了基準測試，包括在多種醫學影像模式上的2D/3D分類與分割。我們通過改變模型大小和輸入圖像分辨率，系統地分析了其可擴展性。研究結果顯示，DINOv3展現出令人印象深刻的性能，並建立了一個強大的新基準。值得注意的是，儘管僅在自然圖像上訓練，它在多項任務上甚至超越了如BiomedCLIP和CT-Net等醫學專用基礎模型。然而，我們也發現了明顯的局限性：在需要深度領域專業化的場景中，如全片病理圖像（WSIs）、電子顯微鏡（EM）和正電子發射斷層掃描（PET），模型的特徵會退化。此外，我們觀察到DINOv3在醫學領域並未始終遵循規模定律；性能並未隨著模型增大或特徵分辨率提高而可靠地提升，顯示出跨任務的多樣化規模行為。最終，我們的工作確立了DINOv3作為一個強有力的基準，其強大的視覺特徵可作為多種複雜醫學任務的穩健先驗。這為未來的研究開闢了有前景的方向，例如利用其特徵來增強3D重建中的多視圖一致性。

English

The advent of large-scale vision foundation models, pre-trained on diverse natural images, has marked a paradigm shift in computer vision. However, how the frontier vision foundation models' efficacies transfer to specialized domains remains such as medical imaging remains an open question. This report investigates whether DINOv3, a state-of-the-art self-supervised vision transformer (ViT) that features strong capability in dense prediction tasks, can directly serve as a powerful, unified encoder for medical vision tasks without domain-specific pre-training. To answer this, we benchmark DINOv3 across common medical vision tasks, including 2D/3D classification and segmentation on a wide range of medical imaging modalities. We systematically analyze its scalability by varying model sizes and input image resolutions. Our findings reveal that DINOv3 shows impressive performance and establishes a formidable new baseline. Remarkably, it can even outperform medical-specific foundation models like BiomedCLIP and CT-Net on several tasks, despite being trained solely on natural images. However, we identify clear limitations: The model's features degrade in scenarios requiring deep domain specialization, such as in Whole-Slide Pathological Images (WSIs), Electron Microscopy (EM), and Positron Emission Tomography (PET). Furthermore, we observe that DINOv3 does not consistently obey scaling law in the medical domain; performance does not reliably increase with larger models or finer feature resolutions, showing diverse scaling behaviors across tasks. Ultimately, our work establishes DINOv3 as a strong baseline, whose powerful visual features can serve as a robust prior for multiple complex medical tasks. This opens promising future directions, such as leveraging its features to enforce multiview consistency in 3D reconstruction.

DINOv3是否為醫學視覺設立了新標準？

Does DINOv3 Set a New Medical Vision Standard?

摘要

Support