多模態中的尋找針插草堆：評估多模態大型語言模型的長文本能力

摘要

多模式大型語言模型（MLLMs）在各種應用中展現了顯著的潛力，引起了研究人員和從業者的廣泛興趣。然而，對它們長文本能力的全面評估仍未得到充分探討。為填補這些空白，我們引入了MultiModal Needle-in-a-haystack（MMNeedle）基準測試，專門設計來評估MLLMs的長文本能力。除了多圖像輸入外，我們採用圖像拼接來進一步增加輸入內容的上下文長度，並制定了一個協議來自動生成子圖像級別的標籤以進行檢索。基本上，MMNeedle通過對圖像內容的文本指令和描述，測試MLLMs定位一個目標子圖像（針）在一組圖像（乾草堆）中的能力。這個設置需要對廣泛的視覺上下文和在長文本圖像輸入中有效的信息檢索有高級理解。通過這個基準測試，我們評估了最先進的MLLMs，包括基於API和開源模型。研究結果顯示，GPT-4o在長文本情境中持續優於其他模型，但在負樣本中存在幻覺問題，即當針不在乾草堆中時。我們對MLLMs的全面長文本評估還揭示了基於API和開源模型之間的顯著性能差距。重現主要結果所需的所有代碼、數據和說明都可以在https://github.com/Wang-ML-Lab/multimodal-needle-in-a-haystack找到。

English

Multimodal Large Language Models (MLLMs) have shown significant promise in various applications, leading to broad interest from researchers and practitioners alike. However, a comprehensive evaluation of their long-context capabilities remains underexplored. To address these gaps, we introduce the MultiModal Needle-in-a-haystack (MMNeedle) benchmark, specifically designed to assess the long-context capabilities of MLLMs. Besides multi-image input, we employ image stitching to further increase the input context length, and develop a protocol to automatically generate labels for sub-image level retrieval. Essentially, MMNeedle evaluates MLLMs by stress-testing their capability to locate a target sub-image (needle) within a set of images (haystack) based on textual instructions and descriptions of image contents. This setup necessitates an advanced understanding of extensive visual contexts and effective information retrieval within long-context image inputs. With this benchmark, we evaluate state-of-the-art MLLMs, encompassing both API-based and open-source models. The findings reveal that GPT-4o consistently surpasses other models in long-context scenarios, but suffers from hallucination problems in negative samples, i.e., when needles are not in the haystacks. Our comprehensive long-context evaluation of MLLMs also sheds lights on the considerable performance gap between API-based and open-source models. All the code, data, and instructions required to reproduce the main results are available at https://github.com/Wang-ML-Lab/multimodal-needle-in-a-haystack.

多模態中的尋找針插草堆：評估多模態大型語言模型的長文本能力

Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models

摘要

Support