ChatPaper.aiChatPaper

ING-VP:目前多層語言模型尚無法玩簡單的基於視覺的遊戲。

ING-VP: MLLMs cannot Play Easy Vision-based Games Yet

October 9, 2024
作者: Haoran Zhang, Hangyu Guo, Shuyue Guo, Meng Cao, Wenhao Huang, Jiaheng Liu, Ge Zhang
cs.AI

摘要

隨著多模式大型語言模型(MLLMs)在各種任務中展現出越來越具競爭力的表現,開發了更為複雜和全面的基準來評估這些尖端模型。這些基準引入了新的挑戰,針對感知、推理和規劃等核心能力。然而,現有的多模式基準在提供基於圖像空間關係的多步規劃的專注評估方面仍有不足。為彌補這一差距,我們提出了ING-VP,第一個專為評估MLLMs的空間想像力和多步推理能力而設計的互動遊戲式視覺規劃基準。ING-VP包含6個獨特的遊戲,涵蓋300個關卡,每個關卡有6種獨特配置。單一模型參與超過60,000輪的互動。該基準框架支持多種比較設置,包括圖像-文本對比文本-only輸入、單步對多步推理,以及有歷史對無歷史條件,為模型能力提供有價值的見解。我們評估了眾多最先進的MLLMs,表現最佳的模型Claude-3.5 Sonnet僅達到平均準確率3.37%,遠低於預期標準。本研究旨在提供一個專門的評估框架,推動MLLMs在複雜空間推理和規劃能力方面的進步。代碼可在https://github.com/Thisisus7/ING-VP.git公開獲取。
English
As multimodal large language models (MLLMs) continue to demonstrate increasingly competitive performance across a broad spectrum of tasks, more intricate and comprehensive benchmarks have been developed to assess these cutting-edge models. These benchmarks introduce new challenges to core capabilities such as perception, reasoning, and planning. However, existing multimodal benchmarks fall short in providing a focused evaluation of multi-step planning based on spatial relationships in images. To bridge this gap, we present ING-VP, the first INteractive Game-based Vision Planning benchmark, specifically designed to evaluate the spatial imagination and multi-step reasoning abilities of MLLMs. ING-VP features 6 distinct games, encompassing 300 levels, each with 6 unique configurations. A single model engages in over 60,000 rounds of interaction. The benchmark framework allows for multiple comparison settings, including image-text vs. text-only inputs, single-step vs. multi-step reasoning, and with-history vs. without-history conditions, offering valuable insights into the model's capabilities. We evaluated numerous state-of-the-art MLLMs, with the highest-performing model, Claude-3.5 Sonnet, achieving an average accuracy of only 3.37%, far below the anticipated standard. This work aims to provide a specialized evaluation framework to drive advancements in MLLMs' capacity for complex spatial reasoning and planning. The code is publicly available at https://github.com/Thisisus7/ING-VP.git.

Summary

AI-Generated Summary

PDF82November 16, 2024