DINOv3가 새로운 의료 비전 표준을 제시하는가?

초록

다양한 자연 이미지로 사전 학습된 대규모 비전 기반 모델의 등장은 컴퓨터 비전 분야에서 패러다임 전환을 가져왔습니다. 그러나 최첨단 비전 기반 모델의 효용성이 의료 영상과 같은 특수 분야로 어떻게 전이되는지는 여전히 미해결된 문제로 남아 있습니다. 본 보고서는 밀집 예측(dense prediction) 작업에서 강력한 성능을 보이는 최신 자기 지도 학습 비전 트랜스포머(ViT)인 DINOv3가 도메인 특화 사전 학습 없이도 의료 비전 작업을 위한 강력하고 통합된 인코더로 직접 사용될 수 있는지 조사합니다. 이를 위해 우리는 DINOv3를 다양한 의료 영상 모달리티에서의 2D/3D 분류 및 세분화를 포함한 일반적인 의료 비전 작업에 대해 벤치마킹합니다. 또한 모델 크기와 입력 이미지 해상도를 변화시켜가며 확장성을 체계적으로 분석합니다. 연구 결과, DINOv3는 인상적인 성능을 보이며 새로운 강력한 기준선을 수립했습니다. 특히, 자연 이미지만으로 학습되었음에도 불구하고 BiomedCLIP 및 CT-Net과 같은 의료 특화 기반 모델을 여러 작업에서 능가할 수 있음을 확인했습니다. 그러나 우리는 명확한 한계점도 발견했습니다: 전체 슬라이드 병리 이미지(WSI), 전자 현미경(EM), 양전자 방출 단층촬영(PET)과 같이 깊은 도메인 특수화가 필요한 시나리오에서는 모델의 특징이 저하되었습니다. 또한, DINOv3가 의료 도메인에서 스케일링 법칙을 일관되게 따르지 않음을 관찰했습니다. 더 큰 모델이나 더 세밀한 특징 해상도가 항상 성능 향상으로 이어지지는 않았으며, 작업 간 다양한 스케일링 행동을 보였습니다. 궁극적으로, 우리의 연구는 DINOv3를 강력한 기준선으로 확립하며, 그 강력한 시각적 특징이 여러 복잡한 의료 작업을 위한 견고한 사전 지식으로 활용될 수 있음을 입증했습니다. 이는 3D 재구성에서 다중 뷰 일관성을 강화하기 위한 특징 활용과 같은 유망한 미래 연구 방향을 열어줍니다.

English

The advent of large-scale vision foundation models, pre-trained on diverse natural images, has marked a paradigm shift in computer vision. However, how the frontier vision foundation models' efficacies transfer to specialized domains remains such as medical imaging remains an open question. This report investigates whether DINOv3, a state-of-the-art self-supervised vision transformer (ViT) that features strong capability in dense prediction tasks, can directly serve as a powerful, unified encoder for medical vision tasks without domain-specific pre-training. To answer this, we benchmark DINOv3 across common medical vision tasks, including 2D/3D classification and segmentation on a wide range of medical imaging modalities. We systematically analyze its scalability by varying model sizes and input image resolutions. Our findings reveal that DINOv3 shows impressive performance and establishes a formidable new baseline. Remarkably, it can even outperform medical-specific foundation models like BiomedCLIP and CT-Net on several tasks, despite being trained solely on natural images. However, we identify clear limitations: The model's features degrade in scenarios requiring deep domain specialization, such as in Whole-Slide Pathological Images (WSIs), Electron Microscopy (EM), and Positron Emission Tomography (PET). Furthermore, we observe that DINOv3 does not consistently obey scaling law in the medical domain; performance does not reliably increase with larger models or finer feature resolutions, showing diverse scaling behaviors across tasks. Ultimately, our work establishes DINOv3 as a strong baseline, whose powerful visual features can serve as a robust prior for multiple complex medical tasks. This opens promising future directions, such as leveraging its features to enforce multiview consistency in 3D reconstruction.

DINOv3가 새로운 의료 비전 표준을 제시하는가?

Does DINOv3 Set a New Medical Vision Standard?

초록

Support