大模型时代的多模态空间推理：综述与基准评测

摘要

人类拥有通过视觉、听觉等多模态观察理解空间的空间推理能力。大型多模态推理模型通过感知与推理学习扩展了这些能力，在各类空间任务中展现出卓越性能。然而针对此类模型的系统性综述与公开基准测试仍较为匮乏。本文对基于大模型的多模态空间推理任务进行全面综述，系统归类多模态大语言模型（MLLMs）的最新进展，并引入开放式评估基准。我们首先概述通用空间推理方法，重点分析后训练技术、可解释性及模型架构。除经典二维任务外，我们还探讨空间关系推理、场景与布局理解、三维空间中的视觉问答与定位，并综述具身人工智能领域的进展，包括视觉语言导航与动作模型。同时关注音频、自我中心视频等新兴模态如何通过新型传感器推动空间理解创新。本综述为该快速发展领域奠定坚实基础并提供前瞻视角。最新研究动态、代码及开放基准实现详见https://github.com/zhengxuJosh/Awesome-Spatial-Reasoning。

English

Humans possess spatial reasoning abilities that enable them to understand spaces through multimodal observations, such as vision and sound. Large multimodal reasoning models extend these abilities by learning to perceive and reason, showing promising performance across diverse spatial tasks. However, systematic reviews and publicly available benchmarks for these models remain limited. In this survey, we provide a comprehensive review of multimodal spatial reasoning tasks with large models, categorizing recent progress in multimodal large language models (MLLMs) and introducing open benchmarks for evaluation. We begin by outlining general spatial reasoning, focusing on post-training techniques, explainability, and architecture. Beyond classical 2D tasks, we examine spatial relationship reasoning, scene and layout understanding, as well as visual question answering and grounding in 3D space. We also review advances in embodied AI, including vision-language navigation and action models. Additionally, we consider emerging modalities such as audio and egocentric video, which contribute to novel spatial understanding through new sensors. We believe this survey establishes a solid foundation and offers insights into the growing field of multimodal spatial reasoning. Updated information about this survey, codes and implementation of the open benchmarks can be found at https://github.com/zhengxuJosh/Awesome-Spatial-Reasoning.