MedCLIPSeg：面向数据高效与泛化性医学图像分割的概率视觉语言自适应方法

摘要

医学影像分割因训练标注有限、解剖特征模糊及域偏移等问题而持续面临挑战。尽管视觉语言模型（如CLIP）具备强大的跨模态表征能力，但其在文本引导的密集医学影像分割领域的潜力尚未得到充分探索。我们提出MedCLIPSeg这一创新框架，通过概率化跨模态注意力机制利用补丁级CLIP嵌入，实现图像与文本标记的双向交互，并显式建模预测不确定性。结合软补丁级对比损失函数（可促进多样化文本提示下的精细化语义学习），该框架有效提升了数据利用效率与领域泛化能力。在涵盖五种成像模态和六个器官的16个数据集上的大量实验表明，MedCLIPSeg在准确性、效率与鲁棒性方面均优于现有方法，并能生成可解释的不确定性图谱以凸显分割结果的局部可靠性。本研究揭示了概率化视觉语言模型在文本驱动医学影像分割中的应用潜力。

English

Medical image segmentation remains challenging due to limited annotations for training, ambiguous anatomical features, and domain shifts. While vision-language models such as CLIP offer strong cross-modal representations, their potential for dense, text-guided medical image segmentation remains underexplored. We present MedCLIPSeg, a novel framework that adapts CLIP for robust, data-efficient, and uncertainty-aware medical image segmentation. Our approach leverages patch-level CLIP embeddings through probabilistic cross-modal attention, enabling bidirectional interaction between image and text tokens and explicit modeling of predictive uncertainty. Together with a soft patch-level contrastive loss that encourages more nuanced semantic learning across diverse textual prompts, MedCLIPSeg effectively improves data efficiency and domain generalizability. Extensive experiments across 16 datasets spanning five imaging modalities and six organs demonstrate that MedCLIPSeg outperforms prior methods in accuracy, efficiency, and robustness, while providing interpretable uncertainty maps that highlight local reliability of segmentation results. This work demonstrates the potential of probabilistic vision-language modeling for text-driven medical image segmentation.

MedCLIPSeg：面向数据高效与泛化性医学图像分割的概率视觉语言自适应方法

MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation

摘要

Support