ConTextual：评估大型多模态模型中的上下文敏感文本丰富视觉推理

摘要

最近人工智能领域的进展导致了大型多模态模型（LMMs）的发展，这些模型能够处理涉及文本和图像内容联合推理的复杂任务（例如，在公共场所中导航地图）。本文介绍了ConTextual，这是一个新颖的基准测试，包含专门设计用于评估LMMs执行具有上下文敏感性的文本丰富的视觉推理能力的指令。ConTextual强调多样的现实场景（例如，时间阅读、导航、购物等），要求对文本和视觉元素之间的互动有更深入的理解。我们的研究结果显示，最佳表现的LMM，GPT-4V(ision)，与人类能力之间存在30.8%的显著性能差距，这是通过人类评估得出的结果，表明在上下文敏感的文本丰富的视觉推理方面仍有很大的改进空间。值得注意的是，虽然GPT-4V在抽象类别如模因和引语解释方面表现出色，但其整体表现仍落后于人类。除了人类评估，我们还使用GPT-4进行自动评估指标，揭示了类似的性能差距趋势。我们还对不同的视觉背景进行了细致的评估，并提供了定性分析，为LMM设计的未来进展提供了坚实的框架。

English

Recent advancements in AI have led to the development of large multimodal models (LMMs) capable of processing complex tasks involving joint reasoning over text and visual content in the image (e.g., navigating maps in public places). This paper introduces ConTextual, a novel benchmark comprising instructions designed explicitly to evaluate LMMs' ability to perform context-sensitive text-rich visual reasoning. ConTextual emphasizes diverse real-world scenarios (e.g., time-reading, navigation, shopping and more) demanding a deeper understanding of the interactions between textual and visual elements. Our findings reveal a significant performance gap of 30.8% between the best-performing LMM, GPT-4V(ision), and human capabilities using human evaluation indicating substantial room for improvement in context-sensitive text-rich visual reasoning. Notably, while GPT-4V excelled in abstract categories like meme and quote interpretation, its overall performance still lagged behind humans. In addition to human evaluations, we also employed automatic evaluation metrics using GPT-4, uncovering similar trends in performance disparities. We also perform a fine-grained evaluation across diverse visual contexts and provide qualitative analysis which provides a robust framework for future advancements in the LMM design. https://con-textual.github.io/

ConTextual：评估大型多模态模型中的上下文敏感文本丰富视觉推理

ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models

摘要

Support