LEAML：面向多模态大语言模型的标签高效适应于分布外视觉任务

摘要

多模態大型語言模型（MLLMs）在通用視覺基準測試中表現出色，但在特定領域（如醫學影像）的分佈外（OOD）任務上卻面臨挑戰，這些領域的標註數據既有限又昂貴。我們提出了LEAML，這是一個標籤高效的適應框架，它充分利用了稀缺的標註視覺問答（VQA）樣本和大量未標註的圖像。我們的方法通過一個受標題蒸餾正則化的問答生成器，為未標註數據生成與領域相關的偽問答對。重要的是，我們選擇性地僅更新與問答最相關的神經元，使問答生成器在蒸餾過程中能高效地獲取領域特定知識。在胃腸內鏡和體育視覺問答上的實驗表明，LEAML在最小監督下始終優於標準的微調方法，這凸顯了我們提出的LEAML框架的有效性。

English

Multimodal Large Language Models (MLLMs) have achieved strong performance on general visual benchmarks but struggle with out-of-distribution (OOD) tasks in specialized domains such as medical imaging, where labeled data is limited and expensive. We introduce LEAML, a label-efficient adaptation framework that leverages both scarce labeled VQA samples and abundant unlabeled images. Our approach generates domain-relevant pseudo question-answer pairs for unlabeled data using a QA generator regularized by caption distillation. Importantly, we selectively update only those neurons most relevant to question-answering, enabling the QA Generator to efficiently acquire domain-specific knowledge during distillation. Experiments on gastrointestinal endoscopy and sports VQA demonstrate that LEAML consistently outperforms standard fine-tuning under minimal supervision, highlighting the effectiveness of our proposed LEAML framework.

LEAML：面向多模态大语言模型的标签高效适应于分布外视觉任务

LEAML: Label-Efficient Adaptation to Out-of-Distribution Visual Tasks for Multimodal Large Language Models

摘要

Support