ChatPaper.aiChatPaper

圖像越多,問題越多?視覺語言模型故障模式的對照分析

More Images, More Problems? A Controlled Analysis of VLM Failure Modes

January 12, 2026
作者: Anurag Das, Adrian Bulat, Alberto Baldrati, Ioannis Maniadis Metaxas, Bernt Schiele, Georgios Tzimiropoulos, Brais Martinez
cs.AI

摘要

大型视觉语言模型(LVLM)已展现出卓越的能力,但其在多图像理解与推理方面的熟练度仍亟待探索。尽管现有基准测试已启动对多图像模型的评估,但对其核心缺陷及其成因的系统性分析仍属空白。本研究推出MIMIC(多图像模型洞察与挑战)基准,旨在严格评估LVLM的多图像处理能力。通过MIMIC进行的系列诊断实验揭示了普遍性问题:LVLM往往难以整合跨图像信息,且在同步追踪或关注多重概念时存在困难。针对这些缺陷,我们提出两项创新性互补解决方案。在数据层面,我们提出程序化数据生成策略,将单图像注释合成为具有针对性的丰富多图像训练样本;在优化层面,我们通过分析层级注意力模式,推导出专为多图像输入设计的注意力掩码方案。实验表明,该方法显著提升了跨图像聚合能力,并在现有多图像基准测试中实现性能提升,多项任务表现超越先前最优水平。数据与代码将发布于https://github.com/anurag-198/MIMIC。
English
Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities, yet their proficiency in understanding and reasoning over multiple images remains largely unexplored. While existing benchmarks have initiated the evaluation of multi-image models, a comprehensive analysis of their core weaknesses and their causes is still lacking. In this work, we introduce MIMIC (Multi-Image Model Insights and Challenges), a new benchmark designed to rigorously evaluate the multi-image capabilities of LVLMs. Using MIMIC, we conduct a series of diagnostic experiments that reveal pervasive issues: LVLMs often fail to aggregate information across images and struggle to track or attend to multiple concepts simultaneously. To address these failures, we propose two novel complementary remedies. On the data side, we present a procedural data-generation strategy that composes single-image annotations into rich, targeted multi-image training examples. On the optimization side, we analyze layer-wise attention patterns and derive an attention-masking scheme tailored for multi-image inputs. Experiments substantially improved cross-image aggregation, while also enhancing performance on existing multi-image benchmarks, outperforming prior state of the art across tasks. Data and code will be made available at https://github.com/anurag-198/MIMIC.
PDF22January 20, 2026