LEAML: マルチモーダル大規模言語モデルのための分布外視覚タスクへのラベル効率的適応

要旨

マルチモーダル大規模言語モデル（MLLMs）は、一般的な視覚ベンチマークにおいて高い性能を発揮する一方で、医療画像などの専門領域における分布外（OOD）タスクには苦戦しており、特にラベル付きデータが限られており高コストである。本論文では、限られたラベル付きVQAサンプルと豊富なラベルなし画像を活用する、ラベル効率的な適応フレームワークであるLEAMLを提案する。本アプローチでは、キャプション蒸留によって正則化されたQAジェネレータを用いて、ラベルなしデータに対してドメイン関連の擬似質問応答ペアを生成する。重要な点として、質問応答に関連するニューロンのみを選択的に更新することで、QAジェネレータが蒸留中に効率的にドメイン固有の知識を獲得できるようにする。消化器内視鏡およびスポーツVQAにおける実験により、LEAMLが最小限の監督下において標準的なファインチューニングを一貫して上回ることを示し、提案したLEAMLフレームワークの有効性を明らかにした。

English

Multimodal Large Language Models (MLLMs) have achieved strong performance on general visual benchmarks but struggle with out-of-distribution (OOD) tasks in specialized domains such as medical imaging, where labeled data is limited and expensive. We introduce LEAML, a label-efficient adaptation framework that leverages both scarce labeled VQA samples and abundant unlabeled images. Our approach generates domain-relevant pseudo question-answer pairs for unlabeled data using a QA generator regularized by caption distillation. Importantly, we selectively update only those neurons most relevant to question-answering, enabling the QA Generator to efficiently acquire domain-specific knowledge during distillation. Experiments on gastrointestinal endoscopy and sports VQA demonstrate that LEAML consistently outperforms standard fine-tuning under minimal supervision, highlighting the effectiveness of our proposed LEAML framework.

LEAML: マルチモーダル大規模言語モデルのための分布外視覚タスクへのラベル効率的適応

LEAML: Label-Efficient Adaptation to Out-of-Distribution Visual Tasks for Multimodal Large Language Models

要旨

Support