MedCLIPSeg: 데이터 효율적이고 일반화 가능한 의료 영상 분할을 위한 확률론적 시각-언어 적응

초록

의료 영상 분할은 훈련용 주석 데이터의 부족, 모호한 해부학적 특징, 도메인 변화로 인해 여전히 어려운 과제로 남아 있습니다. CLIP과 같은 비전-언어 모델이 강력한 교차 모달리티 표현을 제공하지만, 텍스트 기반 고밀도 의료 영상 분할에서의 잠재력은 아직 충분히 탐구되지 않았습니다. 본 연구는 CLIP을 강건하고 데이터 효율적이며 불확실성 인식 의료 영상 분할에 적용하는 새로운 프레임워크인 MedCLIPSeg를 제안합니다. 우리의 접근법은 확률론적 교차 모달리티 어텐션을 통해 패치 수준 CLIP 임베딩을 활용하여 영상과 텍스트 토큰 간의 양방향 상호작용과 예측 불확실성의 명시적 모델링을 가능하게 합니다. 다양한 텍스트 프롬프트 간 미세한 의미론적 학습을 장려하는 소프트 패치 수준 대조 손실과 결합하여 MedCLIPSeg는 데이터 효율성과 도메인 일반화 성능을 효과적으로 향상시킵니다. 5가지 영상 방식과 6개 장기를 아우르는 16개 데이터셋에 대한 폭넓은 실험을 통해 MedCLIPSeg가 정확도, 효율성, 강건성에서 기존 방법을 능가함을 입증하였으며, 분할 결과의 지역적 신뢰도를 강조하는 해석 가능한 불확실성 맵을 제공합니다. 본 연구는 텍스트 주도 의료 영상 분할을 위한 확률론적 비전-언어 모델링의 잠재력을 입증합니다.

English

Medical image segmentation remains challenging due to limited annotations for training, ambiguous anatomical features, and domain shifts. While vision-language models such as CLIP offer strong cross-modal representations, their potential for dense, text-guided medical image segmentation remains underexplored. We present MedCLIPSeg, a novel framework that adapts CLIP for robust, data-efficient, and uncertainty-aware medical image segmentation. Our approach leverages patch-level CLIP embeddings through probabilistic cross-modal attention, enabling bidirectional interaction between image and text tokens and explicit modeling of predictive uncertainty. Together with a soft patch-level contrastive loss that encourages more nuanced semantic learning across diverse textual prompts, MedCLIPSeg effectively improves data efficiency and domain generalizability. Extensive experiments across 16 datasets spanning five imaging modalities and six organs demonstrate that MedCLIPSeg outperforms prior methods in accuracy, efficiency, and robustness, while providing interpretable uncertainty maps that highlight local reliability of segmentation results. This work demonstrates the potential of probabilistic vision-language modeling for text-driven medical image segmentation.

MedCLIPSeg: 데이터 효율적이고 일반화 가능한 의료 영상 분할을 위한 확률론적 시각-언어 적응

MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation

초록

Support