M3Ret:通過自監督釋放零樣本多模態醫學影像檢索的潛力
M3Ret: Unleashing Zero-shot Multimodal Medical Image Retrieval via Self-Supervision
September 1, 2025
作者: Che Liu, Zheng Jiang, Chengyu Fang, Heng Guo, Yan-Jie Zhou, Jiaqi Qu, Le Lu, Minfeng Xu
cs.AI
摘要
醫學影像檢索對於臨床決策和轉化研究至關重要,其依賴於具有區分性的視覺表徵。然而,當前的方法仍然分散,針對2D、3D和基於視頻的醫學數據依賴於獨立的架構和訓練策略。這種特定模態的設計阻礙了可擴展性,並抑制了統一表徵的發展。為了實現統一學習,我們策劃了一個大規模混合模態數據集,包含867,653個醫學影像樣本,包括2D X光和超聲波、RGB內窺鏡視頻以及3D CT掃描。利用這一數據集,我們訓練了M3Ret,這是一個無需任何模態特定定制的統一視覺編碼器。它成功地通過生成式(MAE)和對比式(SimDINO)自監督學習(SSL)範式學習了可遷移的表徵。我們的方法在所有單一模態的零樣本圖像到圖像檢索中設定了新的最先進水平,超越了DINOv3和文本監督的BMC-CLIP等強基線。更為顯著的是,在沒有配對數據的情況下出現了強烈的跨模態對齊,並且模型能夠泛化到未見的MRI任務,儘管在預訓練期間從未觀察過MRI,這展示了純視覺自監督對未見模態的泛化能力。全面的分析進一步驗證了我們框架在模型和數據規模上的可擴展性。這些發現為醫學影像社區提供了有希望的信號,將M3Ret定位為多模態醫學影像理解中視覺SSL基礎模型邁出的一步。
English
Medical image retrieval is essential for clinical decision-making and
translational research, relying on discriminative visual representations. Yet,
current methods remain fragmented, relying on separate architectures and
training strategies for 2D, 3D, and video-based medical data. This
modality-specific design hampers scalability and inhibits the development of
unified representations. To enable unified learning, we curate a large-scale
hybrid-modality dataset comprising 867,653 medical imaging samples, including
2D X-rays and ultrasounds, RGB endoscopy videos, and 3D CT scans. Leveraging
this dataset, we train M3Ret, a unified visual encoder without any
modality-specific customization. It successfully learns transferable
representations using both generative (MAE) and contrastive (SimDINO)
self-supervised learning (SSL) paradigms. Our approach sets a new
state-of-the-art in zero-shot image-to-image retrieval across all individual
modalities, surpassing strong baselines such as DINOv3 and the text-supervised
BMC-CLIP. More remarkably, strong cross-modal alignment emerges without paired
data, and the model generalizes to unseen MRI tasks, despite never observing
MRI during pretraining, demonstrating the generalizability of purely visual
self-supervision to unseen modalities. Comprehensive analyses further validate
the scalability of our framework across model and data sizes. These findings
deliver a promising signal to the medical imaging community, positioning M3Ret
as a step toward foundation models for visual SSL in multimodal medical image
understanding.