LEGOパズル：MLLMは多段階の空間推論にどの程度優れているか？

要旨

多段階空間推論とは、複数の連続的なステップにわたる空間関係を理解し、推論することを意味し、ロボット操作、自律ナビゲーション、自動化組立などの複雑な現実世界のアプリケーションに取り組む上で極めて重要です。現在のマルチモーダル大規模言語モデル（MLLM）がこの基本的な能力をどの程度習得しているかを評価するために、LEGOベースのタスクを通じて空間理解と逐次推論を評価するためのスケーラブルなベンチマークであるLEGO-Puzzlesを導入します。LEGO-Puzzlesは、基本的な空間理解から複雑な多段階推論まで、11の異なるタスクにわたる1,100の慎重に選ばれた視覚的質問応答（VQA）サンプルで構成されています。LEGO-Puzzlesに基づいて、最先端のMLLMを包括的に評価し、その空間推論能力に重大な限界があることを明らかにしました：最も強力なMLLMでさえ、テストケースの約半分しか回答できず、人間の参加者は90％以上の精度を達成します。VQAタスクに加えて、MLLMが組立図に従ってLEGO画像を生成する能力も評価します。実験の結果、Gemini-2.0-FlashとGPT-4oのみがこれらの指示に従う限定的な能力を示し、他のMLLMは入力画像を複製するか、完全に関連のない出力を生成することがわかりました。全体として、LEGO-Puzzlesは既存のMLLMの空間理解と逐次推論能力における重大な欠陥を暴露し、マルチモーダル空間推論のさらなる進歩の必要性を強調しています。

English

Multi-step spatial reasoning entails understanding and reasoning about spatial relationships across multiple sequential steps, which is crucial for tackling complex real-world applications, such as robotic manipulation, autonomous navigation, and automated assembly. To assess how well current Multimodal Large Language Models (MLLMs) have acquired this fundamental capability, we introduce LEGO-Puzzles, a scalable benchmark designed to evaluate both spatial understanding and sequential reasoning in MLLMs through LEGO-based tasks. LEGO-Puzzles consists of 1,100 carefully curated visual question-answering (VQA) samples spanning 11 distinct tasks, ranging from basic spatial understanding to complex multi-step reasoning. Based on LEGO-Puzzles, we conduct a comprehensive evaluation of state-of-the-art MLLMs and uncover significant limitations in their spatial reasoning capabilities: even the most powerful MLLMs can answer only about half of the test cases, whereas human participants achieve over 90\% accuracy. In addition to VQA tasks, we evaluate MLLMs' abilities to generate LEGO images following assembly illustrations. Our experiments show that only Gemini-2.0-Flash and GPT-4o exhibit a limited ability to follow these instructions, while other MLLMs either replicate the input image or generate completely irrelevant outputs. Overall, LEGO-Puzzles exposes critical deficiencies in existing MLLMs' spatial understanding and sequential reasoning capabilities, and underscores the need for further advancements in multimodal spatial reasoning.

LEGOパズル：MLLMは多段階の空間推論にどの程度優れているか？

LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?

要旨

Support