M3Ret: 자기 지도를 통한 제로샷 멀티모달 의료 영상 검색의 가능성 탐구

초록

의료 영상 검색은 임상 의사결정과 전환 연구에 필수적이며, 이를 위해 구별력 있는 시각적 표현에 의존합니다. 그러나 현재의 방법들은 2D, 3D, 그리고 비디오 기반 의료 데이터에 대해 별도의 아키텍처와 학습 전략을 사용함으로써 분열된 상태를 유지하고 있습니다. 이러한 모달리티 특화 설계는 확장성을 저해하고 통합된 표현의 개발을 방해합니다. 통합 학습을 가능하게 하기 위해, 우리는 2D X-레이와 초음파, RGB 내시경 비디오, 그리고 3D CT 스캔을 포함한 867,653개의 의료 영상 샘플로 구성된 대규모 하이브리드 모달리티 데이터셋을 구축했습니다. 이 데이터셋을 활용하여, 우리는 모달리티 특화 커스터마이제이션 없이 통합된 시각적 인코더인 M3Ret을 학습시켰습니다. M3Ret은 생성적(MAE)과 대조적(SimDINO) 자기 지도 학습(SSL) 패러다임을 모두 사용하여 전이 가능한 표현을 성공적으로 학습합니다. 우리의 접근 방식은 모든 개별 모달리티에서 제로샷 이미지-이미지 검색에서 새로운 최첨단 성능을 달성하며, DINOv3와 텍스트 지도 BMC-CLIP과 같은 강력한 베이스라인을 능가합니다. 더욱 주목할 만한 점은, 짝지어진 데이터 없이도 강력한 교차 모달리티 정렬이 나타나며, 모델이 사전 학습 중에 MRI를 전혀 관찰하지 않았음에도 불구하고 보이지 않는 MRI 작업에 일반화된다는 것입니다. 이는 순수 시각적 자기 지도 학습이 보이지 않는 모달리티에 대한 일반화 가능성을 입증합니다. 포괄적인 분석은 우리의 프레임워크가 모델 및 데이터 크기에 걸쳐 확장 가능함을 추가로 검증합니다. 이러한 발견들은 의료 영상 커뮤니티에 유망한 신호를 전달하며, M3Ret을 다중 모달리티 의료 영상 이해를 위한 시각적 SSL 기반 모델로 나아가는 한 걸음으로 위치시킵니다.

English

Medical image retrieval is essential for clinical decision-making and translational research, relying on discriminative visual representations. Yet, current methods remain fragmented, relying on separate architectures and training strategies for 2D, 3D, and video-based medical data. This modality-specific design hampers scalability and inhibits the development of unified representations. To enable unified learning, we curate a large-scale hybrid-modality dataset comprising 867,653 medical imaging samples, including 2D X-rays and ultrasounds, RGB endoscopy videos, and 3D CT scans. Leveraging this dataset, we train M3Ret, a unified visual encoder without any modality-specific customization. It successfully learns transferable representations using both generative (MAE) and contrastive (SimDINO) self-supervised learning (SSL) paradigms. Our approach sets a new state-of-the-art in zero-shot image-to-image retrieval across all individual modalities, surpassing strong baselines such as DINOv3 and the text-supervised BMC-CLIP. More remarkably, strong cross-modal alignment emerges without paired data, and the model generalizes to unseen MRI tasks, despite never observing MRI during pretraining, demonstrating the generalizability of purely visual self-supervision to unseen modalities. Comprehensive analyses further validate the scalability of our framework across model and data sizes. These findings deliver a promising signal to the medical imaging community, positioning M3Ret as a step toward foundation models for visual SSL in multimodal medical image understanding.

M3Ret: 자기 지도를 통한 제로샷 멀티모달 의료 영상 검색의 가능성 탐구

M3Ret: Unleashing Zero-shot Multimodal Medical Image Retrieval via Self-Supervision

초록

Support