ChatPaper.aiChatPaper

ING-VP:目前MLLMs还无法玩简单的基于视觉的游戏。

ING-VP: MLLMs cannot Play Easy Vision-based Games Yet

October 9, 2024
作者: Haoran Zhang, Hangyu Guo, Shuyue Guo, Meng Cao, Wenhao Huang, Jiaheng Liu, Ge Zhang
cs.AI

摘要

随着多模态大型语言模型(MLLMs)在各种任务中展现出越来越具竞争力的表现,为了评估这些尖端模型,已经开发出了更为复杂和全面的基准。这些基准引入了新的挑战,涉及感知、推理和规划等核心能力。然而,现有的多模态基准在提供基于图像空间关系的多步规划的专注评估方面存在不足。为了弥补这一缺口,我们提出了ING-VP,第一个专门设计用于评估MLLMs空间想象和多步推理能力的基于互动游戏的视觉规划基准。ING-VP包括6个独特的游戏,涵盖300个关卡,每个关卡有6种独特配置。单个模型参与超过60,000轮的互动。该基准框架支持多种比较设置,包括图像-文本对比文本输入、单步对比多步推理、有历史对比无历史条件,为了深入了解模型的能力提供了宝贵的见解。我们评估了众多最先进的MLLMs,表现最佳的模型Claude-3.5 Sonnet的平均准确率仅为3.37%,远低于预期标准。本研究旨在提供一个专门评估框架,推动MLLMs在复杂空间推理和规划能力方面的发展。代码可在https://github.com/Thisisus7/ING-VP.git 上公开获取。
English
As multimodal large language models (MLLMs) continue to demonstrate increasingly competitive performance across a broad spectrum of tasks, more intricate and comprehensive benchmarks have been developed to assess these cutting-edge models. These benchmarks introduce new challenges to core capabilities such as perception, reasoning, and planning. However, existing multimodal benchmarks fall short in providing a focused evaluation of multi-step planning based on spatial relationships in images. To bridge this gap, we present ING-VP, the first INteractive Game-based Vision Planning benchmark, specifically designed to evaluate the spatial imagination and multi-step reasoning abilities of MLLMs. ING-VP features 6 distinct games, encompassing 300 levels, each with 6 unique configurations. A single model engages in over 60,000 rounds of interaction. The benchmark framework allows for multiple comparison settings, including image-text vs. text-only inputs, single-step vs. multi-step reasoning, and with-history vs. without-history conditions, offering valuable insights into the model's capabilities. We evaluated numerous state-of-the-art MLLMs, with the highest-performing model, Claude-3.5 Sonnet, achieving an average accuracy of only 3.37%, far below the anticipated standard. This work aims to provide a specialized evaluation framework to drive advancements in MLLMs' capacity for complex spatial reasoning and planning. The code is publicly available at https://github.com/Thisisus7/ING-VP.git.

Summary

AI-Generated Summary

PDF82November 16, 2024