M3Ret：自己教師あり学習によるゼロショットマルチモーダル医用画像検索の実現

要旨

医療画像検索は、臨床意思決定やトランスレーショナルリサーチにおいて不可欠であり、識別可能な視覚的表現に依存しています。しかし、現在の手法は断片的なままであり、2D、3D、およびビデオベースの医療データに対して別々のアーキテクチャとトレーニング戦略に依存しています。このモダリティ固有の設計は、スケーラビリティを妨げ、統一された表現の開発を阻害しています。統一的な学習を可能にするため、私たちは2D X線や超音波、RGB内視鏡ビデオ、3D CTスキャンを含む867,653の医療画像サンプルからなる大規模なハイブリッドモダリティデータセットをキュレーションしました。このデータセットを活用し、モダリティ固有のカスタマイズなしで統一された視覚エンコーダであるM3Retをトレーニングしました。M3Retは、生成的（MAE）および対照的（SimDINO）な自己教師あり学習（SSL）パラダイムを使用して、転移可能な表現を成功裏に学習します。私たちのアプローチは、すべての個別モダリティにおけるゼロショット画像間検索において、DINOv3やテキスト監視型BMC-CLIPなどの強力なベースラインを超える新たな最先端を確立しました。さらに注目すべきは、ペアデータなしで強力なクロスモーダルアライメントが現れ、モデルが未見のMRIタスクに一般化することです。これは、事前学習中にMRIを観察したことがないにもかかわらず、純粋な視覚的自己教師あり学習が未見のモダリティに一般化可能であることを示しています。包括的な分析は、モデルとデータサイズにわたる私たちのフレームワークのスケーラビリティをさらに検証します。これらの発見は、医療画像コミュニティに有望なシグナルを提供し、M3Retをマルチモーダル医療画像理解における視覚的SSLの基盤モデルに向けた一歩として位置づけます。

English

Medical image retrieval is essential for clinical decision-making and translational research, relying on discriminative visual representations. Yet, current methods remain fragmented, relying on separate architectures and training strategies for 2D, 3D, and video-based medical data. This modality-specific design hampers scalability and inhibits the development of unified representations. To enable unified learning, we curate a large-scale hybrid-modality dataset comprising 867,653 medical imaging samples, including 2D X-rays and ultrasounds, RGB endoscopy videos, and 3D CT scans. Leveraging this dataset, we train M3Ret, a unified visual encoder without any modality-specific customization. It successfully learns transferable representations using both generative (MAE) and contrastive (SimDINO) self-supervised learning (SSL) paradigms. Our approach sets a new state-of-the-art in zero-shot image-to-image retrieval across all individual modalities, surpassing strong baselines such as DINOv3 and the text-supervised BMC-CLIP. More remarkably, strong cross-modal alignment emerges without paired data, and the model generalizes to unseen MRI tasks, despite never observing MRI during pretraining, demonstrating the generalizability of purely visual self-supervision to unseen modalities. Comprehensive analyses further validate the scalability of our framework across model and data sizes. These findings deliver a promising signal to the medical imaging community, positioning M3Ret as a step toward foundation models for visual SSL in multimodal medical image understanding.

M3Ret：自己教師あり学習によるゼロショットマルチモーダル医用画像検索の実現

M3Ret: Unleashing Zero-shot Multimodal Medical Image Retrieval via Self-Supervision

要旨

Support