論多模態大型語言模型在醫學影像中的組合泛化能力
On the Compositional Generalization of Multimodal LLMs for Medical Imaging
December 28, 2024
作者: Zhenyang Cai, Junying Chen, Rongsheng Wang, Weihong Wang, Yonglin Deng, Dingjie Song, Yize Chen, Zixu Zhang, Benyou Wang
cs.AI
摘要
多模態大型語言模型(MLLMs)在醫療領域具有顯著潛力,但其能力常因特定醫學領域數據不足而受限,這凸顯了需明確理解哪些類型的影像能被MLLMs用於泛化。現有研究表明,多任務訓練因不同任務可相互促進而優於單任務訓練,但這些研究往往忽略任務間的內在關聯,對如何選擇數據集以增強特定任務的指導有限。為分析此現象,我們嘗試採用組合泛化(CG)——即模型透過重組已學習元素來理解新穎組合的能力——作為指導框架。由於醫學影像可透過成像模態、解剖區域與任務目標進行精確定義,自然為探索CG提供了理想環境。為此,我們整合106個醫學數據集構建Med-MAT以進行全面實驗。實驗證實MLLMs能運用CG理解未見過的醫學影像,並確定CG是多任務訓練中觀察到泛化現象的主要驅動因素之一。進一步研究還表明,CG能有效支援數據量有限的數據集,並在不同骨幹網絡中保持穩定性能,彰顯其通用性與廣泛適用性。Med-MAT已公開於https://github.com/FreedomIntelligence/Med-MAT。
English
Multimodal large language models (MLLMs) hold significant potential in the
medical field, but their capabilities are often limited by insufficient data in
certain medical domains, highlighting the need for understanding what kinds of
images can be used by MLLMs for generalization. Current research suggests that
multi-task training outperforms single-task as different tasks can benefit each
other, but they often overlook the internal relationships within these tasks,
providing limited guidance on selecting datasets to enhance specific tasks. To
analyze this phenomenon, we attempted to employ compositional generalization
(CG)-the ability of models to understand novel combinations by recombining
learned elements-as a guiding framework. Since medical images can be precisely
defined by Modality, Anatomical area, and Task, naturally providing an
environment for exploring CG. Therefore, we assembled 106 medical datasets to
create Med-MAT for comprehensive experiments. The experiments confirmed that
MLLMs can use CG to understand unseen medical images and identified CG as one
of the main drivers of the generalization observed in multi-task training.
Additionally, further studies demonstrated that CG effectively supports
datasets with limited data and delivers consistent performance across different
backbones, highlighting its versatility and broad applicability. Med-MAT is
publicly available at https://github.com/FreedomIntelligence/Med-MAT.