LEAML：面向多模态大语言模型的标签高效适应于分布外视觉任务

摘要

多模态大语言模型（MLLMs）在通用视觉基准测试中表现出色，但在特定领域（如医学影像）的分布外（OOD）任务上表现欠佳，这些领域标注数据稀缺且成本高昂。我们提出了LEAML，一种标签高效的适应框架，该框架充分利用了少量标注的视觉问答（VQA）样本和大量未标注的图像。我们的方法通过问答生成器生成与领域相关的伪问答对，该生成器受到标题蒸馏的约束。重要的是，我们仅选择性更新与问答最相关的神经元，使得问答生成器在蒸馏过程中能高效获取领域特定知识。在胃肠内镜和体育视觉问答上的实验表明，在最小监督条件下，LEAML始终优于标准微调方法，凸显了我们提出的LEAML框架的有效性。

English

Multimodal Large Language Models (MLLMs) have achieved strong performance on general visual benchmarks but struggle with out-of-distribution (OOD) tasks in specialized domains such as medical imaging, where labeled data is limited and expensive. We introduce LEAML, a label-efficient adaptation framework that leverages both scarce labeled VQA samples and abundant unlabeled images. Our approach generates domain-relevant pseudo question-answer pairs for unlabeled data using a QA generator regularized by caption distillation. Importantly, we selectively update only those neurons most relevant to question-answering, enabling the QA Generator to efficiently acquire domain-specific knowledge during distillation. Experiments on gastrointestinal endoscopy and sports VQA demonstrate that LEAML consistently outperforms standard fine-tuning under minimal supervision, highlighting the effectiveness of our proposed LEAML framework.

LEAML：面向多模态大语言模型的标签高效适应于分布外视觉任务

LEAML: Label-Efficient Adaptation to Out-of-Distribution Visual Tasks for Multimodal Large Language Models

摘要

Support