ChatPaper.aiChatPaper

LEGO拼圖:多模態大語言模型在多步驟空間推理中的表現如何?

LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?

March 25, 2025
作者: Kexian Tang, Junyao Gao, Yanhong Zeng, Haodong Duan, Yanan Sun, Zhening Xing, Wenran Liu, Kaifeng Lyu, Kai Chen
cs.AI

摘要

多步驟空間推理涉及對多個連續步驟中的空間關係進行理解與推理,這對於應對複雜的現實世界應用(如機器人操作、自主導航和自動化組裝)至關重要。為評估當前多模態大型語言模型(MLLMs)是否具備這一基本能力,我們引入了LEGO-Puzzles,這是一個可擴展的基準測試,旨在通過基於樂高的任務來評估MLLMs的空間理解與序列推理能力。LEGO-Puzzles包含1,100個精心策劃的視覺問答(VQA)樣本,涵蓋11種不同的任務,從基本的空間理解到複雜的多步驟推理。基於LEGO-Puzzles,我們對最先進的MLLMs進行了全面評估,並發現其在空間推理能力上存在顯著局限:即使是最強大的MLLMs也只能回答約一半的測試案例,而人類參與者的準確率超過90%。除了VQA任務外,我們還評估了MLLMs根據組裝示意圖生成樂高圖像的能力。實驗結果顯示,僅有Gemini-2.0-Flash和GPT-4o展現了有限的指令遵循能力,而其他MLLMs要麼複製輸入圖像,要麼生成完全無關的輸出。總體而言,LEGO-Puzzles揭示了現有MLLMs在空間理解與序列推理能力上的關鍵不足,並強調了在多模態空間推理領域進一步發展的必要性。
English
Multi-step spatial reasoning entails understanding and reasoning about spatial relationships across multiple sequential steps, which is crucial for tackling complex real-world applications, such as robotic manipulation, autonomous navigation, and automated assembly. To assess how well current Multimodal Large Language Models (MLLMs) have acquired this fundamental capability, we introduce LEGO-Puzzles, a scalable benchmark designed to evaluate both spatial understanding and sequential reasoning in MLLMs through LEGO-based tasks. LEGO-Puzzles consists of 1,100 carefully curated visual question-answering (VQA) samples spanning 11 distinct tasks, ranging from basic spatial understanding to complex multi-step reasoning. Based on LEGO-Puzzles, we conduct a comprehensive evaluation of state-of-the-art MLLMs and uncover significant limitations in their spatial reasoning capabilities: even the most powerful MLLMs can answer only about half of the test cases, whereas human participants achieve over 90\% accuracy. In addition to VQA tasks, we evaluate MLLMs' abilities to generate LEGO images following assembly illustrations. Our experiments show that only Gemini-2.0-Flash and GPT-4o exhibit a limited ability to follow these instructions, while other MLLMs either replicate the input image or generate completely irrelevant outputs. Overall, LEGO-Puzzles exposes critical deficiencies in existing MLLMs' spatial understanding and sequential reasoning capabilities, and underscores the need for further advancements in multimodal spatial reasoning.

Summary

AI-Generated Summary

PDF342March 27, 2025