大模型时代的多模态空间推理:综述与基准评测
Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks
October 29, 2025
作者: Xu Zheng, Zihao Dongfang, Lutao Jiang, Boyuan Zheng, Yulong Guo, Zhenquan Zhang, Giuliano Albanese, Runyi Yang, Mengjiao Ma, Zixin Zhang, Chenfei Liao, Dingcheng Zhen, Yuanhuiyi Lyu, Yuqian Fu, Bin Ren, Linfeng Zhang, Danda Pani Paudel, Nicu Sebe, Luc Van Gool, Xuming Hu
cs.AI
摘要
人类具备通过视觉与听觉等多模态观察来理解空间的空间推理能力。大型多模态推理模型通过感知与推理学习拓展了这些能力,在各类空间任务中展现出卓越性能。然而针对这些模型的系统性综述与公开基准测试仍显不足。本文对基于大模型的多模态空间推理任务进行全面综述,系统归类多模态大语言模型(MLLMs)的最新进展,并引入开放式基准评估体系。我们首先概述通用空间推理方法,重点关注后训练技术、可解释性及模型架构。除经典二维任务外,我们还探讨空间关系推理、场景与布局理解、三维空间中的视觉问答与定位,并综述具身人工智能领域的进展,包括视觉语言导航与动作模型。同时关注音频与第一人称视角视频等新兴模态,这些新技术通过新型传感器为空间理解开辟了新途径。本综述旨在为快速发展的多模态空间推理领域奠定坚实基础并提供前瞻视角。相关最新资料、代码及开放基准实现请访问:https://github.com/zhengxuJosh/Awesome-Spatial-Reasoning。
English
Humans possess spatial reasoning abilities that enable them to understand
spaces through multimodal observations, such as vision and sound. Large
multimodal reasoning models extend these abilities by learning to perceive and
reason, showing promising performance across diverse spatial tasks. However,
systematic reviews and publicly available benchmarks for these models remain
limited. In this survey, we provide a comprehensive review of multimodal
spatial reasoning tasks with large models, categorizing recent progress in
multimodal large language models (MLLMs) and introducing open benchmarks for
evaluation. We begin by outlining general spatial reasoning, focusing on
post-training techniques, explainability, and architecture. Beyond classical 2D
tasks, we examine spatial relationship reasoning, scene and layout
understanding, as well as visual question answering and grounding in 3D space.
We also review advances in embodied AI, including vision-language navigation
and action models. Additionally, we consider emerging modalities such as audio
and egocentric video, which contribute to novel spatial understanding through
new sensors. We believe this survey establishes a solid foundation and offers
insights into the growing field of multimodal spatial reasoning. Updated
information about this survey, codes and implementation of the open benchmarks
can be found at https://github.com/zhengxuJosh/Awesome-Spatial-Reasoning.