MindJourney:基于世界模型的空间推理测试时扩展
MindJourney: Test-Time Scaling with World Models for Spatial Reasoning
July 16, 2025
作者: Yuncong Yang, Jiageng Liu, Zheyuan Zhang, Siyuan Zhou, Reuben Tan, Jianwei Yang, Yilun Du, Chuang Gan
cs.AI
摘要
三维空间中的空间推理是人类认知的核心,对于导航和操作等具身任务不可或缺。然而,当前最先进的视觉-语言模型(VLMs)在处理诸如预测自我中心运动后场景变化这样简单的任务时常常力不从心:它们能感知二维图像,却缺乏对三维动态的内部建模。为此,我们提出了MindJourney,一个测试时扩展框架,通过将VLM与基于视频扩散的可控世界模型相结合,赋予其这一缺失的能力。VLM迭代地勾勒出简洁的相机轨迹,而世界模型则在每一步合成对应的视图。随后,VLM基于交互探索过程中收集的多视角证据进行推理。无需任何微调,我们的MindJourney在代表性空间推理基准SAT上平均提升了超过8%的性能,表明将VLM与世界模型配对用于测试时扩展,为稳健的三维推理提供了一条简单即插即用的途径。同时,我们的方法也优于通过强化学习训练的测试时推理VLM,这展示了利用世界模型进行测试时扩展的潜力。
English
Spatial reasoning in 3D space is central to human cognition and indispensable
for embodied tasks such as navigation and manipulation. However,
state-of-the-art vision-language models (VLMs) struggle frequently with tasks
as simple as anticipating how a scene will look after an egocentric motion:
they perceive 2D images but lack an internal model of 3D dynamics. We therefore
propose MindJourney, a test-time scaling framework that grants a VLM with this
missing capability by coupling it to a controllable world model based on video
diffusion. The VLM iteratively sketches a concise camera trajectory, while the
world model synthesizes the corresponding view at each step. The VLM then
reasons over this multi-view evidence gathered during the interactive
exploration. Without any fine-tuning, our MindJourney achieves over an average
8% performance boost on the representative spatial reasoning benchmark SAT,
showing that pairing VLMs with world models for test-time scaling offers a
simple, plug-and-play route to robust 3D reasoning. Meanwhile, our method also
improves upon the test-time inference VLMs trained through reinforcement
learning, which demonstrates the potential of our method that utilizes world
models for test-time scaling.