真のマルチモーダル・インコンテキスト学習には視覚的コンテキストへの注意が必要である

要旨

強力な言語基盤を基に構築されたマルチモーダル大規模言語モデル（MLLMs）は、画像、質問、回答からなる少数のマルチモーダルデモンストレーションに基づいて新しいタスクに適応するマルチモーダルインコンテキスト学習（MICL）を可能にしました。標準的な視覚言語データセットで顕著な改善を示しているにもかかわらず、現在のMLLMsはデモンストレーション内の視覚情報を活用するのに苦労しています。具体的には、視覚的な手がかりを無視し、テキストのパターンに過度に依存する傾向があり、真のマルチモーダル適応ではなく、単なるテキストの模倣に留まっています。この振る舞いにより、MICLは依然として単一モーダルであり、その実用的な有用性が大きく制限されています。さらに重要なことに、この制限は視覚的コンテキストの理解を必要としないタスクでのパフォーマンス向上によってしばしば隠されています。その結果、MICL能力を効果的に向上させ、MICLパフォーマンスを確実に評価する方法はまだ十分に検討されていません。これらの問題に対処するため、我々はまず、視覚的コンテキストに注意を向けるようモデルを促すために、視覚的トークンとテキストトークン間の注意を再調整する効率的なファインチューニング戦略である動的注意再配分（DARA）を導入します。さらに、真のMICL（TrueMICL）という、マルチモーダル情報、特に視覚的コンテンツの統合を明示的に要求するサポートセットとテストセットを備えたMICL専用データセットを提示します。広範な実験により、我々の包括的ソリューションの有効性が実証され、真のマルチモーダルインコンテキスト学習能力の大幅な向上が示されています。コードとデータセットはhttps://chenxshuo.github.io/true-micl-colmで利用可能です。

English

Multimodal Large Language Models (MLLMs), built on powerful language backbones, have enabled Multimodal In-Context Learning (MICL)-adapting to new tasks from a few multimodal demonstrations consisting of images, questions, and answers. Despite showing noticeable improvement on standard vision-language datasets, current MLLMs struggle to leverage visual information in the demonstrations. Specifically, they tend to neglect visual cues and over-rely on textual patterns, leading to mere text imitation rather than genuine multimodal adaptation. This behavior makes MICL still unimodal and largely restricts its practical utility. More importantly, this limitation is often concealed by the improved performance on tasks that do not require understanding the visual context. As a result, how to effectively enhance MICL ability and reliably evaluate the MICL performance remains underexplored. To address these issues, we first introduce Dynamic Attention Reallocation (DARA), an efficient fine-tuning strategy that encourages models to attend to the visual context by rebalancing attention across visual and textual tokens. In addition, we present TrueMICL, an MICL-dedicated dataset with both support and test sets that explicitly requires the integration of multimodal information-particularly visual content-for correct task completion. Extensive experiments demonstrate the effectiveness of our holistic solution, showcasing substantial improvements in the true multimodal in-context learning capabilities. Code and datasets are available at https://chenxshuo.github.io/true-micl-colm .

真のマルチモーダル・インコンテキスト学習には視覚的コンテキストへの注意が必要である

True Multimodal In-Context Learning Needs Attention to the Visual Context

要旨

Support