LEAML:面向多模态大语言模型的标签高效适应于分布外视觉任务
LEAML: Label-Efficient Adaptation to Out-of-Distribution Visual Tasks for Multimodal Large Language Models
October 3, 2025
作者: Ci-Siang Lin, Min-Hung Chen, Yu-Yang Sheng, Yu-Chiang Frank Wang
cs.AI
摘要
多模態大型語言模型(MLLMs)在通用視覺基準測試中表現出色,但在特定領域(如醫學影像)的分佈外(OOD)任務上卻面臨挑戰,這些領域的標註數據既有限又昂貴。我們提出了LEAML,這是一個標籤高效的適應框架,它充分利用了稀缺的標註視覺問答(VQA)樣本和大量未標註的圖像。我們的方法通過一個受標題蒸餾正則化的問答生成器,為未標註數據生成與領域相關的偽問答對。重要的是,我們選擇性地僅更新與問答最相關的神經元,使問答生成器在蒸餾過程中能高效地獲取領域特定知識。在胃腸內鏡和體育視覺問答上的實驗表明,LEAML在最小監督下始終優於標準的微調方法,這凸顯了我們提出的LEAML框架的有效性。
English
Multimodal Large Language Models (MLLMs) have achieved strong performance on
general visual benchmarks but struggle with out-of-distribution (OOD) tasks in
specialized domains such as medical imaging, where labeled data is limited and
expensive. We introduce LEAML, a label-efficient adaptation framework that
leverages both scarce labeled VQA samples and abundant unlabeled images. Our
approach generates domain-relevant pseudo question-answer pairs for unlabeled
data using a QA generator regularized by caption distillation. Importantly, we
selectively update only those neurons most relevant to question-answering,
enabling the QA Generator to efficiently acquire domain-specific knowledge
during distillation. Experiments on gastrointestinal endoscopy and sports VQA
demonstrate that LEAML consistently outperforms standard fine-tuning under
minimal supervision, highlighting the effectiveness of our proposed LEAML
framework.