ChatPaper.aiChatPaper

真正的多模态上下文学习需关注视觉上下文

True Multimodal In-Context Learning Needs Attention to the Visual Context

July 21, 2025
作者: Shuo Chen, Jianzhe Liu, Zhen Han, Yan Xia, Daniel Cremers, Philip Torr, Volker Tresp, Jindong Gu
cs.AI

摘要

基于强大语言基础构建的多模态大语言模型(MLLMs)已实现了多模态上下文学习(MICL)——通过包含图像、问题及答案的少量多模态示例适应新任务。尽管在标准视觉-语言数据集上显示出显著改进,当前MLLMs在利用示例中的视觉信息方面仍面临挑战。具体而言,它们往往忽视视觉线索,过度依赖文本模式,导致仅模仿文本而非真正实现多模态适应。这种行为使得MICL实质上仍为单模态,大大限制了其实际应用价值。更重要的是,这一局限常被那些无需理解视觉上下文的任务性能提升所掩盖。因此,如何有效增强MICL能力并可靠评估其表现仍待深入探索。针对这些问题,我们首先提出了动态注意力重分配(DARA),一种通过重新平衡视觉与文本标记间注意力来鼓励模型关注视觉上下文的高效微调策略。此外,我们推出了TrueMICL,一个专为MICL设计的数据集,包含支持集与测试集,明确要求整合多模态信息——特别是视觉内容——以正确完成任务。大量实验验证了我们整体解决方案的有效性,展示了在多模态上下文学习真实能力上的显著提升。代码与数据集可在https://chenxshuo.github.io/true-micl-colm获取。
English
Multimodal Large Language Models (MLLMs), built on powerful language backbones, have enabled Multimodal In-Context Learning (MICL)-adapting to new tasks from a few multimodal demonstrations consisting of images, questions, and answers. Despite showing noticeable improvement on standard vision-language datasets, current MLLMs struggle to leverage visual information in the demonstrations. Specifically, they tend to neglect visual cues and over-rely on textual patterns, leading to mere text imitation rather than genuine multimodal adaptation. This behavior makes MICL still unimodal and largely restricts its practical utility. More importantly, this limitation is often concealed by the improved performance on tasks that do not require understanding the visual context. As a result, how to effectively enhance MICL ability and reliably evaluate the MICL performance remains underexplored. To address these issues, we first introduce Dynamic Attention Reallocation (DARA), an efficient fine-tuning strategy that encourages models to attend to the visual context by rebalancing attention across visual and textual tokens. In addition, we present TrueMICL, an MICL-dedicated dataset with both support and test sets that explicitly requires the integration of multimodal information-particularly visual content-for correct task completion. Extensive experiments demonstrate the effectiveness of our holistic solution, showcasing substantial improvements in the true multimodal in-context learning capabilities. Code and datasets are available at https://chenxshuo.github.io/true-micl-colm .
PDF12July 25, 2025