MedCLIPSeg: データ効率性と一般化性に優れた医用画像セグメンテーションのための確率的視覚言語適応

要旨

医用画像セグメンテーションは、学習用のアノテーション不足、解剖学的特徴の曖昧さ、ドメインシフトにより、依然として課題が多い。CLIPなどの視覚言語モデルは強力なクロスモーダル表現を提供するが、密なテキスト誘導型医用画像セグメンテーションへの応用可能性は十分に検討されていない。本研究では、CLIPをロバストでデータ効率が高く、不確実性を考慮した医用画像セグメンテーションに適応させる新規フレームワークMedCLIPSegを提案する。本手法は、確率的クロスモーダルアテンションを通じてパッチレベルのCLIP埋め込みを活用し、画像トークンとテキストトークンの双方向的な相互作用と予測不確実性の明示的なモデリングを実現する。さらに、多様なテキストプロンプト間の細やかな意味論的学習を促進するソフトパッチレベル対照損失と組み合わせることで、MedCLIPSegはデータ効率とドメイン一般化性を効果的に向上させる。5つの画像モダリティと6つの臓器にわたる16のデータセットでの大規模な実験により、MedCLIPSegが精度、効率性、ロバスト性の面で従来手法を上回り、セグメンテーション結果の局所的信頼性を強調する解釈可能な不確実性マップを提供することを実証した。本研究成果は、テキスト駆動型医用画像セグメンテーションにおける確率的視覚言語モデリングの可能性を示すものである。

English

Medical image segmentation remains challenging due to limited annotations for training, ambiguous anatomical features, and domain shifts. While vision-language models such as CLIP offer strong cross-modal representations, their potential for dense, text-guided medical image segmentation remains underexplored. We present MedCLIPSeg, a novel framework that adapts CLIP for robust, data-efficient, and uncertainty-aware medical image segmentation. Our approach leverages patch-level CLIP embeddings through probabilistic cross-modal attention, enabling bidirectional interaction between image and text tokens and explicit modeling of predictive uncertainty. Together with a soft patch-level contrastive loss that encourages more nuanced semantic learning across diverse textual prompts, MedCLIPSeg effectively improves data efficiency and domain generalizability. Extensive experiments across 16 datasets spanning five imaging modalities and six organs demonstrate that MedCLIPSeg outperforms prior methods in accuracy, efficiency, and robustness, while providing interpretable uncertainty maps that highlight local reliability of segmentation results. This work demonstrates the potential of probabilistic vision-language modeling for text-driven medical image segmentation.

MedCLIPSeg: データ効率性と一般化性に優れた医用画像セグメンテーションのための確率的視覚言語適応

MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation

要旨

Support