ChatPaper.aiChatPaper

基于多粒度语言学习的医疗视觉理解增强

Boosting Medical Visual Understanding From Multi-Granular Language Learning

November 20, 2025
作者: Zihan Li, Yiqing Wang, Sina Farsiu, Paul Kinahan
cs.AI

摘要

近期图文预训练技术通过对齐视觉与文本表征,显著提升了视觉理解能力。对比语言-图像预训练(CLIP)在多模态学习中发挥了关键作用,但其单标签单粒度的对齐方式限制了在医学影像等复杂领域的应用——这类图像常对应多个高层级标签(如疾病分类)及不同标注粒度(如诊断描述、临床解释)。为此,我们提出多粒度语言学习(MGLL),这是一种对比学习框架,旨在同时提升多标签与跨粒度对齐能力。MGLL利用结构化多标签监督,整合不同粒度的文本描述,并引入带逐点约束的软标签监督以增强对齐效果。该框架采用平滑KL散度确保跨粒度一致性,同时作为即插即用模块保持计算效率。基于我们构建的大规模多粒度数据集进行预训练,并在多个数据集上验证,MGLL在下游任务中超越了现有先进方法。代码已开源:https://github.com/HUANGLIZI/MGLL。
English
Recent advances in image-text pretraining have significantly enhanced visual understanding by aligning visual and textual representations. Contrastive Language-Image Pretraining (CLIP) has played a pivotal role in multimodal learning. However, its focus on single-label, single-granularity alignment limits its effectiveness in complex domains such as medical imaging, where images often correspond to multiple high-level labels (e.g., disease categories) across different annotation granularities (e.g., diagnostic description, clinical explanation). To address this, we propose Multi-Granular Language Learning (MGLL), a contrastive learning framework designed to improve both multi-label and cross-granularity alignment. MGLL leverages structured multi-label supervision, integrates textual descriptions across granularities, and introduces soft-label supervision with point-wise constraints to enhance alignment. MGLL employs smooth Kullback-Leibler (KL) divergence to ensure cross-granularity consistency while maintaining computational efficiency as a plug-and-play module for vision-language models. Pretrained on our constructed large-scale multi-granular datasets and evaluated across multiple datasets, MGLL outperforms other state-of-the-art methods in downstream tasks. The code is available at https://github.com/HUANGLIZI/MGLL{https://github.com/HUANGLIZI/MGLL}.
PDF12December 1, 2025