M3Ret:通过自监督实现零样本多模态医学图像检索的突破
M3Ret: Unleashing Zero-shot Multimodal Medical Image Retrieval via Self-Supervision
September 1, 2025
作者: Che Liu, Zheng Jiang, Chengyu Fang, Heng Guo, Yan-Jie Zhou, Jiaqi Qu, Le Lu, Minfeng Xu
cs.AI
摘要
医学图像检索对于临床决策和转化研究至关重要,其依赖于具有区分性的视觉表征。然而,当前方法仍显零散,针对2D、3D及视频类医学数据分别采用不同的架构与训练策略。这种针对特定模态的设计阻碍了可扩展性,并抑制了统一表征的发展。为实现统一学习,我们构建了一个大规模混合模态数据集,包含867,653个医学影像样本,涵盖2D X光片与超声图像、RGB内窥镜视频以及3D CT扫描。利用此数据集,我们训练了M3Ret,一个无需任何模态特定定制的统一视觉编码器。它成功利用生成式(MAE)与对比式(SimDINO)自监督学习(SSL)范式,习得了可迁移的表征。我们的方法在所有单一模态的零样本图像到图像检索任务中均创下了新的最先进水平,超越了如DINOv3和文本监督的BMC-CLIP等强劲基线。更为显著的是,即便没有配对数据,模型也展现出强大的跨模态对齐能力,并且能够泛化至未见过的MRI任务,尽管在预训练期间从未接触过MRI数据,这证明了纯视觉自监督学习对未见模态的泛化能力。全面的分析进一步验证了我们的框架在模型与数据规模上的可扩展性。这些发现为医学影像领域传递了积极信号,将M3Ret定位为迈向多模态医学图像理解中视觉自监督学习基础模型的重要一步。
English
Medical image retrieval is essential for clinical decision-making and
translational research, relying on discriminative visual representations. Yet,
current methods remain fragmented, relying on separate architectures and
training strategies for 2D, 3D, and video-based medical data. This
modality-specific design hampers scalability and inhibits the development of
unified representations. To enable unified learning, we curate a large-scale
hybrid-modality dataset comprising 867,653 medical imaging samples, including
2D X-rays and ultrasounds, RGB endoscopy videos, and 3D CT scans. Leveraging
this dataset, we train M3Ret, a unified visual encoder without any
modality-specific customization. It successfully learns transferable
representations using both generative (MAE) and contrastive (SimDINO)
self-supervised learning (SSL) paradigms. Our approach sets a new
state-of-the-art in zero-shot image-to-image retrieval across all individual
modalities, surpassing strong baselines such as DINOv3 and the text-supervised
BMC-CLIP. More remarkably, strong cross-modal alignment emerges without paired
data, and the model generalizes to unseen MRI tasks, despite never observing
MRI during pretraining, demonstrating the generalizability of purely visual
self-supervision to unseen modalities. Comprehensive analyses further validate
the scalability of our framework across model and data sizes. These findings
deliver a promising signal to the medical imaging community, positioning M3Ret
as a step toward foundation models for visual SSL in multimodal medical image
understanding.