MedCLIPSeg：基于概率视觉语言自适应的数据高效泛化医学图像分割方法

摘要

医学图像分割因训练标注有限、解剖特征模糊及域偏移等问题仍具挑战。尽管CLIP等视觉语言模型具备强大的跨模态表征能力，其在文本引导的密集医学图像分割领域的潜力尚未充分发掘。我们提出MedCLIPSeg新型框架，通过概率化跨模态注意力机制适配CLIP模型，实现鲁棒、数据高效且具备不确定性感知的医学图像分割。该方法利用块级CLIP嵌入，建立图像与文本标记的双向交互，并显式建模预测不确定性。结合软块级对比学习损失函数促进多样化文本提示下的精细化语义学习，MedCLIPSeg显著提升了数据利用效率与领域泛化能力。在涵盖5种影像模态和6个器官的16个数据集上的实验表明，该方法在精度、效率和鲁棒性上均优于现有技术，同时可生成凸显分割结果局部可靠性的可解释不确定性图谱。本研究揭示了概率化视觉语言模型在文本驱动医学图像分割中的应用潜力。

English

Medical image segmentation remains challenging due to limited annotations for training, ambiguous anatomical features, and domain shifts. While vision-language models such as CLIP offer strong cross-modal representations, their potential for dense, text-guided medical image segmentation remains underexplored. We present MedCLIPSeg, a novel framework that adapts CLIP for robust, data-efficient, and uncertainty-aware medical image segmentation. Our approach leverages patch-level CLIP embeddings through probabilistic cross-modal attention, enabling bidirectional interaction between image and text tokens and explicit modeling of predictive uncertainty. Together with a soft patch-level contrastive loss that encourages more nuanced semantic learning across diverse textual prompts, MedCLIPSeg effectively improves data efficiency and domain generalizability. Extensive experiments across 16 datasets spanning five imaging modalities and six organs demonstrate that MedCLIPSeg outperforms prior methods in accuracy, efficiency, and robustness, while providing interpretable uncertainty maps that highlight local reliability of segmentation results. This work demonstrates the potential of probabilistic vision-language modeling for text-driven medical image segmentation.

MedCLIPSeg：基于概率视觉语言自适应的数据高效泛化医学图像分割方法

MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation

摘要

Support