LEGO 퍼즐: 다단계 공간 추론에서 MLLM의 성능은 어느 정도인가?

초록

다단계 공간 추론은 여러 순차적 단계에 걸친 공간 관계를 이해하고 추론하는 것을 포함하며, 이는 로봇 조작, 자율 주행, 자동화 조립과 같은 복잡한 실제 애플리케이션을 해결하는 데 필수적입니다. 현재의 다중모달 대형 언어 모델(MLLMs)이 이러한 기본적인 능력을 어느 정도 습득했는지 평가하기 위해, 우리는 LEGO 기반 작업을 통해 MLLMs의 공간 이해와 순차적 추론을 평가할 수 있는 확장 가능한 벤치마크인 LEGO-Puzzles를 소개합니다. LEGO-Puzzles는 기본적인 공간 이해부터 복잡한 다단계 추론에 이르는 11가지 독특한 작업을 아우르는 1,100개의 신중하게 선별된 시각적 질의응답(VQA) 샘플로 구성되어 있습니다. LEGO-Puzzles를 기반으로, 우리는 최신 MLLMs에 대한 포괄적인 평가를 수행하고 그들의 공간 추론 능력에서 상당한 한계를 발견했습니다: 가장 강력한 MLLMs조차 테스트 케이스의 약 절반만 답변할 수 있는 반면, 인간 참가자들은 90% 이상의 정확도를 달성했습니다. VQA 작업 외에도, 우리는 MLLMs가 조립 설명서를 따라 LEGO 이미지를 생성하는 능력을 평가했습니다. 우리의 실험은 Gemini-2.0-Flash와 GPT-4o만이 이러한 지시를 따라가는 제한된 능력을 보여주는 반면, 다른 MLLMs는 입력 이미지를 복제하거나 완전히 관련 없는 출력을 생성하는 것으로 나타났습니다. 전반적으로, LEGO-Puzzles는 기존 MLLMs의 공간 이해와 순차적 추론 능력에서 중요한 결함을 드러내며, 다중모달 공간 추론 분야의 추가 발전이 필요함을 강조합니다.

English

Multi-step spatial reasoning entails understanding and reasoning about spatial relationships across multiple sequential steps, which is crucial for tackling complex real-world applications, such as robotic manipulation, autonomous navigation, and automated assembly. To assess how well current Multimodal Large Language Models (MLLMs) have acquired this fundamental capability, we introduce LEGO-Puzzles, a scalable benchmark designed to evaluate both spatial understanding and sequential reasoning in MLLMs through LEGO-based tasks. LEGO-Puzzles consists of 1,100 carefully curated visual question-answering (VQA) samples spanning 11 distinct tasks, ranging from basic spatial understanding to complex multi-step reasoning. Based on LEGO-Puzzles, we conduct a comprehensive evaluation of state-of-the-art MLLMs and uncover significant limitations in their spatial reasoning capabilities: even the most powerful MLLMs can answer only about half of the test cases, whereas human participants achieve over 90\% accuracy. In addition to VQA tasks, we evaluate MLLMs' abilities to generate LEGO images following assembly illustrations. Our experiments show that only Gemini-2.0-Flash and GPT-4o exhibit a limited ability to follow these instructions, while other MLLMs either replicate the input image or generate completely irrelevant outputs. Overall, LEGO-Puzzles exposes critical deficiencies in existing MLLMs' spatial understanding and sequential reasoning capabilities, and underscores the need for further advancements in multimodal spatial reasoning.

LEGO 퍼즐: 다단계 공간 추론에서 MLLM의 성능은 어느 정도인가?

LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?

초록

Support