ChatPaper.aiChatPaper

LEGO拼图:多模态大语言模型在多步空间推理中的表现如何?

LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?

March 25, 2025
作者: Kexian Tang, Junyao Gao, Yanhong Zeng, Haodong Duan, Yanan Sun, Zhening Xing, Wenran Liu, Kaifeng Lyu, Kai Chen
cs.AI

摘要

多步空间推理涉及对多个连续步骤中空间关系的理解与推理,这对于解决复杂的现实世界应用至关重要,如机器人操作、自主导航和自动化装配。为了评估当前多模态大语言模型(MLLMs)是否已掌握这一基础能力,我们引入了LEGO-Puzzles,一个可扩展的基准测试,旨在通过基于乐高的任务来评估MLLMs的空间理解与序列推理能力。LEGO-Puzzles包含1100个精心策划的视觉问答(VQA)样本,涵盖11项不同任务,从基础的空间理解到复杂的多步推理。基于LEGO-Puzzles,我们对最先进的MLLMs进行了全面评估,并揭示了它们在空间推理能力上的显著局限:即使是最强大的MLLMs也只能回答约一半的测试案例,而人类参与者的准确率超过90%。除了VQA任务外,我们还评估了MLLMs根据组装示意图生成乐高图像的能力。实验表明,仅Gemini-2.0-Flash和GPT-4o展现出有限的指令遵循能力,而其他MLLMs要么复制输入图像,要么生成完全无关的输出。总体而言,LEGO-Puzzles暴露了现有MLLMs在空间理解与序列推理能力上的关键不足,并强调了在多模态空间推理领域进一步发展的必要性。
English
Multi-step spatial reasoning entails understanding and reasoning about spatial relationships across multiple sequential steps, which is crucial for tackling complex real-world applications, such as robotic manipulation, autonomous navigation, and automated assembly. To assess how well current Multimodal Large Language Models (MLLMs) have acquired this fundamental capability, we introduce LEGO-Puzzles, a scalable benchmark designed to evaluate both spatial understanding and sequential reasoning in MLLMs through LEGO-based tasks. LEGO-Puzzles consists of 1,100 carefully curated visual question-answering (VQA) samples spanning 11 distinct tasks, ranging from basic spatial understanding to complex multi-step reasoning. Based on LEGO-Puzzles, we conduct a comprehensive evaluation of state-of-the-art MLLMs and uncover significant limitations in their spatial reasoning capabilities: even the most powerful MLLMs can answer only about half of the test cases, whereas human participants achieve over 90\% accuracy. In addition to VQA tasks, we evaluate MLLMs' abilities to generate LEGO images following assembly illustrations. Our experiments show that only Gemini-2.0-Flash and GPT-4o exhibit a limited ability to follow these instructions, while other MLLMs either replicate the input image or generate completely irrelevant outputs. Overall, LEGO-Puzzles exposes critical deficiencies in existing MLLMs' spatial understanding and sequential reasoning capabilities, and underscores the need for further advancements in multimodal spatial reasoning.

Summary

AI-Generated Summary

PDF342March 27, 2025