진정한 다중모달 인-컨텍스트 학습은 시각적 컨텍스트에 주의를 기울여야 한다

초록

강력한 언어 기반 위에 구축된 멀티모달 대형 언어 모델(MLLMs)은 이미지, 질문, 답변으로 구성된 소수의 멀티모달 데모를 통해 새로운 작업에 적응하는 멀티모달 인컨텍스트 학습(MICL)을 가능하게 했습니다. 표준 시각-언어 데이터셋에서 눈에 띄는 개선을 보였음에도 불구하고, 현재의 MLLMs는 데모에서 시각 정보를 활용하는 데 어려움을 겪고 있습니다. 특히, 이들은 시각적 단서를 무시하고 텍스트 패턴에 지나치게 의존하는 경향이 있어, 진정한 멀티모달 적응이 아닌 단순한 텍스트 모방으로 이어지고 있습니다. 이러한 행동은 MICL을 여전히 단일 모달로 만들며, 그 실용성을 크게 제한합니다. 더 중요한 것은, 이러한 한계가 시각적 맥락을 이해할 필요가 없는 작업에서의 성능 향상으로 인해 종종 가려진다는 점입니다. 결과적으로, MICL 능력을 효과적으로 향상시키고 MICL 성능을 신뢰할 수 있게 평가하는 방법은 아직 충분히 탐구되지 않았습니다. 이러한 문제를 해결하기 위해, 우리는 먼저 시각적 맥락에 주의를 기울이도록 모델을 유도하는 효율적인 미세 조정 전략인 동적 주의 재배치(DARA)를 소개합니다. 또한, 작업 완료를 위해 멀티모달 정보, 특히 시각적 내용의 통합을 명시적으로 요구하는 지원 및 테스트 세트를 포함한 MICL 전용 데이터셋인 TrueMICL을 제시합니다. 광범위한 실험을 통해 우리의 종합적인 솔루션의 효과를 입증하며, 진정한 멀티모달 인컨텍스트 학습 능력의 상당한 개선을 보여줍니다. 코드와 데이터셋은 https://chenxshuo.github.io/true-micl-colm에서 확인할 수 있습니다.

English

Multimodal Large Language Models (MLLMs), built on powerful language backbones, have enabled Multimodal In-Context Learning (MICL)-adapting to new tasks from a few multimodal demonstrations consisting of images, questions, and answers. Despite showing noticeable improvement on standard vision-language datasets, current MLLMs struggle to leverage visual information in the demonstrations. Specifically, they tend to neglect visual cues and over-rely on textual patterns, leading to mere text imitation rather than genuine multimodal adaptation. This behavior makes MICL still unimodal and largely restricts its practical utility. More importantly, this limitation is often concealed by the improved performance on tasks that do not require understanding the visual context. As a result, how to effectively enhance MICL ability and reliably evaluate the MICL performance remains underexplored. To address these issues, we first introduce Dynamic Attention Reallocation (DARA), an efficient fine-tuning strategy that encourages models to attend to the visual context by rebalancing attention across visual and textual tokens. In addition, we present TrueMICL, an MICL-dedicated dataset with both support and test sets that explicitly requires the integration of multimodal information-particularly visual content-for correct task completion. Extensive experiments demonstrate the effectiveness of our holistic solution, showcasing substantial improvements in the true multimodal in-context learning capabilities. Code and datasets are available at https://chenxshuo.github.io/true-micl-colm .

진정한 다중모달 인-컨텍스트 학습은 시각적 컨텍스트에 주의를 기울여야 한다

True Multimodal In-Context Learning Needs Attention to the Visual Context

초록

Support