通过文本表征引导推理释放多模态大语言模型的空间推理能力
Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning
March 24, 2026
作者: Jiacheng Hua, Yishu Yin, Yuhang Wu, Tai Wang, Yifei Huang, Miao Liu
cs.AI
摘要
现有多模态大语言模型(MLLMs)在三维空间推理方面存在明显局限,难以从视频输入中构建结构化环境抽象表征。为突破此瓶颈,我们借鉴以自我为中心的空间认知理论,探索如何使MLLMs能够对视频的文本化空间表征进行建模推理。具体而言,我们提出自中心视频的异中心语境文本化表征方法(TRACE),该提示策略能引导MLLMs生成三维环境的文本化表征作为中间推理轨迹,从而提升空间问答准确性。TRACE通过编码元语境、相机运动轨迹和细粒度物体实体,为自中心视角视频提供结构化空间推理支持。在VSI-Bench和OST-Bench上的大量实验表明,针对不同参数规模与训练架构的MLLM骨干网络,TRACE相较现有提示策略均能产生显著且稳定的性能提升。我们进一步通过消融实验验证了方案设计的合理性,并深入剖析了当前MLLMs三维空间推理的瓶颈所在。
English
Existing Multimodal Large Language Models (MLLMs) struggle with 3D spatial reasoning, as they fail to construct structured abstractions of the 3D environment depicted in video inputs. To bridge this gap, drawing inspiration from cognitive theories of allocentric spatial reasoning, we investigate how to enable MLLMs to model and reason over text-based spatial representations of video. Specifically, we introduce Textual Representation of Allocentric Context from Egocentric Video (TRACE), a prompting method that induces MLLMs to generate text-based representations of 3D environments as intermediate reasoning traces for more accurate spatial question answering. TRACE encodes meta-context, camera trajectories, and detailed object entities to support structured spatial reasoning over egocentric videos. Extensive experiments on VSI-Bench and OST-Bench demonstrate that TRACE yields notable and consistent improvements over prior prompting strategies across a diverse range of MLLM backbones, spanning different parameter scales and training schemas. We further present ablation studies to validate our design choices, along with detailed analyses that probe the bottlenecks of 3D spatial reasoning in MLLMs.