LEAML:面向多模态大语言模型的标签高效适应于分布外视觉任务
LEAML: Label-Efficient Adaptation to Out-of-Distribution Visual Tasks for Multimodal Large Language Models
October 3, 2025
作者: Ci-Siang Lin, Min-Hung Chen, Yu-Yang Sheng, Yu-Chiang Frank Wang
cs.AI
摘要
多模态大语言模型(MLLMs)在通用视觉基准测试中表现出色,但在特定领域(如医学影像)的分布外(OOD)任务上表现欠佳,这些领域标注数据稀缺且成本高昂。我们提出了LEAML,一种标签高效的适应框架,该框架充分利用了少量标注的视觉问答(VQA)样本和大量未标注的图像。我们的方法通过问答生成器生成与领域相关的伪问答对,该生成器受到标题蒸馏的约束。重要的是,我们仅选择性更新与问答最相关的神经元,使得问答生成器在蒸馏过程中能高效获取领域特定知识。在胃肠内镜和体育视觉问答上的实验表明,在最小监督条件下,LEAML始终优于标准微调方法,凸显了我们提出的LEAML框架的有效性。
English
Multimodal Large Language Models (MLLMs) have achieved strong performance on
general visual benchmarks but struggle with out-of-distribution (OOD) tasks in
specialized domains such as medical imaging, where labeled data is limited and
expensive. We introduce LEAML, a label-efficient adaptation framework that
leverages both scarce labeled VQA samples and abundant unlabeled images. Our
approach generates domain-relevant pseudo question-answer pairs for unlabeled
data using a QA generator regularized by caption distillation. Importantly, we
selectively update only those neurons most relevant to question-answering,
enabling the QA Generator to efficiently acquire domain-specific knowledge
during distillation. Experiments on gastrointestinal endoscopy and sports VQA
demonstrate that LEAML consistently outperforms standard fine-tuning under
minimal supervision, highlighting the effectiveness of our proposed LEAML
framework.