ChatPaper.aiChatPaper

MARBLE:面向多模态空间推理与规划的严苛基准测试

MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning

June 28, 2025
作者: Yulun Jiang, Yekun Chai, Maria Brbić, Michael Moor
cs.AI

摘要

处理多模态信息并逐步推理的能力,仍是推动人工智能发展的关键挑战。然而,现有的推理基准主要集中于纯文本推理,或采用那些通过直接检索非文本模态信息即可回答的多模态问题。因此,复杂推理在多模态领域仍鲜为人知。在此,我们推出MARBLE,一个旨在严格检验多模态语言模型(MLLMs)在复杂多模态问题与环境中逐步推理能力的挑战性多模态推理基准。MARBLE包含两项极具挑战性的任务——M-Portal与M-Cube,它们要求在多步规划中融入空间、视觉及物理约束的理解与构建。我们发现,当前MLLMs在MARBLE上表现欠佳——所有12个先进模型在M-Portal上接近随机水平,而在M-Cube上准确率为0%。仅在简化子任务中,部分模型表现优于随机基线,这表明复杂推理对现有MLLMs而言仍是一大难题。此外,我们揭示感知仍是瓶颈,MLLMs有时无法从视觉输入中有效提取信息。通过揭示MLLMs的局限,我们希望MARBLE能激励新一代模型的研发,使其具备跨越多模态推理步骤进行推理与规划的能力。
English
The ability to process information from multiple modalities and to reason through it step-by-step remains a critical challenge in advancing artificial intelligence. However, existing reasoning benchmarks focus on text-only reasoning, or employ multimodal questions that can be answered by directly retrieving information from a non-text modality. Thus, complex reasoning remains poorly understood in multimodal domains. Here, we present MARBLE, a challenging multimodal reasoning benchmark that is designed to scrutinize multimodal language models (MLLMs) in their ability to carefully reason step-by-step through complex multimodal problems and environments. MARBLE is composed of two highly challenging tasks, M-Portal and M-Cube, that require the crafting and understanding of multistep plans under spatial, visual, and physical constraints. We find that current MLLMs perform poorly on MARBLE -- all the 12 advanced models obtain near-random performance on M-Portal and 0% accuracy on M-Cube. Only in simplified subtasks some models outperform the random baseline, indicating that complex reasoning is still a challenge for existing MLLMs. Moreover, we show that perception remains a bottleneck, where MLLMs occasionally fail to extract information from the visual inputs. By shedding a light on the limitations of MLLMs, we hope that MARBLE will spur the development of the next generation of models with the ability to reason and plan across many, multimodal reasoning steps.
PDF34July 1, 2025