マルチグラニュラー言語学習による医療視覚理解の促進

要旨

画像テキスト事前学習の最近の進歩は、視覚的表現とテキスト的表現を整合させることで、視覚的理解を大幅に向上させてきた。対照的言語画像事前学習（CLIP）はマルチモーダル学習において重要な役割を果たしている。しかし、その単一ラベル・単一粒度の整合性への焦点は、医療画像のような複雑な領域における有効性を制限している。医療画像では、画像が複数の高レベルラベル（例：疾患カテゴリ）や異なる注釈粒度（例：診断記述、臨床的説明）に対応することが多い。この問題に対処するため、我々はマルチラベルおよびクロス粒度の整合性を改善するように設計された対照学習フレームワーク、Multi-Granular Language Learning（MGLL）を提案する。MGLLは構造化されたマルチラベル監督を活用し、粒度を超えたテキスト記述を統合し、ポイントワイズ制約を用いたソフトラベル監督を導入して整合性を強化する。MGLLは滑らかなKLダイバージェンスを採用し、計算効率を維持しながらクロス粒度の一貫性を確保する。これはビジョン言語モデルのためのプラグアンドプレイモジュールとして機能する。構築した大規模マルチ粒度データセットで事前学習し、複数のデータセットで評価した結果、MGLLは下流タスクにおいて他の最先端手法を凌駕する性能を示した。コードはhttps://github.com/HUANGLIZI/MGLL で公開されている。

English

Recent advances in image-text pretraining have significantly enhanced visual understanding by aligning visual and textual representations. Contrastive Language-Image Pretraining (CLIP) has played a pivotal role in multimodal learning. However, its focus on single-label, single-granularity alignment limits its effectiveness in complex domains such as medical imaging, where images often correspond to multiple high-level labels (e.g., disease categories) across different annotation granularities (e.g., diagnostic description, clinical explanation). To address this, we propose Multi-Granular Language Learning (MGLL), a contrastive learning framework designed to improve both multi-label and cross-granularity alignment. MGLL leverages structured multi-label supervision, integrates textual descriptions across granularities, and introduces soft-label supervision with point-wise constraints to enhance alignment. MGLL employs smooth Kullback-Leibler (KL) divergence to ensure cross-granularity consistency while maintaining computational efficiency as a plug-and-play module for vision-language models. Pretrained on our constructed large-scale multi-granular datasets and evaluated across multiple datasets, MGLL outperforms other state-of-the-art methods in downstream tasks. The code is available at https://github.com/HUANGLIZI/MGLL{https://github.com/HUANGLIZI/MGLL}.

マルチグラニュラー言語学習による医療視覚理解の促進

Boosting Medical Visual Understanding From Multi-Granular Language Learning

要旨

Support