ChatPaper.aiChatPaper

MARBLE:面向多模态空间推理与规划的严苛基准测试

MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning

June 28, 2025
作者: Yulun Jiang, Yekun Chai, Maria Brbić, Michael Moor
cs.AI

摘要

處理多模態信息並逐步推理的能力,仍是推進人工智能發展的關鍵挑戰。然而,現有的推理基準測試僅專注於純文本推理,或採用可通過直接從非文本模態檢索信息來回答的多模態問題。因此,在多模態領域中,複雜推理仍未被充分理解。在此,我們提出MARBLE,一個旨在嚴格審查多模態語言模型(MLLMs)在逐步推理複雜多模態問題和環境中能力的挑戰性多模態推理基準。MARBLE由兩個極具挑戰性的任務組成,即M-Portal和M-Cube,它們要求在空間、視覺和物理約束下制定和理解多步計劃。我們發現,當前MLLMs在MARBLE上表現不佳——所有12個先進模型在M-Portal上的表現接近隨機,而在M-Cube上的準確率為0%。僅在簡化的子任務中,部分模型表現優於隨機基線,表明複雜推理對現有MLLMs而言仍是一大挑戰。此外,我們指出感知仍是瓶頸,MLLMs偶爾無法從視覺輸入中提取信息。通過揭示MLLMs的局限性,我們希望MARBLE能激勵下一代具備跨越多模態推理步驟進行推理和規劃能力的模型的開發。
English
The ability to process information from multiple modalities and to reason through it step-by-step remains a critical challenge in advancing artificial intelligence. However, existing reasoning benchmarks focus on text-only reasoning, or employ multimodal questions that can be answered by directly retrieving information from a non-text modality. Thus, complex reasoning remains poorly understood in multimodal domains. Here, we present MARBLE, a challenging multimodal reasoning benchmark that is designed to scrutinize multimodal language models (MLLMs) in their ability to carefully reason step-by-step through complex multimodal problems and environments. MARBLE is composed of two highly challenging tasks, M-Portal and M-Cube, that require the crafting and understanding of multistep plans under spatial, visual, and physical constraints. We find that current MLLMs perform poorly on MARBLE -- all the 12 advanced models obtain near-random performance on M-Portal and 0% accuracy on M-Cube. Only in simplified subtasks some models outperform the random baseline, indicating that complex reasoning is still a challenge for existing MLLMs. Moreover, we show that perception remains a bottleneck, where MLLMs occasionally fail to extract information from the visual inputs. By shedding a light on the limitations of MLLMs, we hope that MARBLE will spur the development of the next generation of models with the ability to reason and plan across many, multimodal reasoning steps.
PDF54July 1, 2025