ConTextual：評估大型多模型中的上下文敏感文本豐富視覺推理

摘要

近年來人工智慧的進步推動了大型多模型模型（LMMs）的發展，這些模型能夠處理涉及文本和圖像內容聯合推理的複雜任務（例如，在公共場所中導航地圖）。本文介紹了一個名為ConTextual的新型基準，其中包含專門設計來評估LMMs執行具有上下文敏感性的文本豐富視覺推理能力。ConTextual強調多樣的現實情境（例如時間閱讀、導航、購物等），要求對文本和視覺元素之間的互動有更深入的理解。我們的研究發現，在最佳表現的LMM，GPT-4V(ision)，與人類能力之間存在著30.8%的顯著性能差距，這是通過人類評估得出的結果，顯示在具有上下文敏感性的文本豐富視覺推理方面還有很大的改進空間。值得注意的是，儘管GPT-4V在抽象類別（如模因和引文解釋）方面表現出色，但其整體表現仍落後於人類。除了人類評估外，我們還使用GPT-4進行自動評估指標，揭示了性能差異的相似趨勢。我們還對不同視覺背景進行了細緻評估，並提供了定性分析，為LMM設計的未來進步提供了堅實的框架。 https://con-textual.github.io/

English

Recent advancements in AI have led to the development of large multimodal models (LMMs) capable of processing complex tasks involving joint reasoning over text and visual content in the image (e.g., navigating maps in public places). This paper introduces ConTextual, a novel benchmark comprising instructions designed explicitly to evaluate LMMs' ability to perform context-sensitive text-rich visual reasoning. ConTextual emphasizes diverse real-world scenarios (e.g., time-reading, navigation, shopping and more) demanding a deeper understanding of the interactions between textual and visual elements. Our findings reveal a significant performance gap of 30.8% between the best-performing LMM, GPT-4V(ision), and human capabilities using human evaluation indicating substantial room for improvement in context-sensitive text-rich visual reasoning. Notably, while GPT-4V excelled in abstract categories like meme and quote interpretation, its overall performance still lagged behind humans. In addition to human evaluations, we also employed automatic evaluation metrics using GPT-4, uncovering similar trends in performance disparities. We also perform a fine-grained evaluation across diverse visual contexts and provide qualitative analysis which provides a robust framework for future advancements in the LMM design. https://con-textual.github.io/

ConTextual：評估大型多模型中的上下文敏感文本豐富視覺推理

ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models

摘要

Support