MindJourney：基于世界模型的空间推理测试时扩展

摘要

三维空间中的空间推理是人类认知的核心，对于导航和操作等具身任务不可或缺。然而，当前最先进的视觉-语言模型（VLMs）在处理诸如预测自我中心运动后场景变化这样简单的任务时常常力不从心：它们能感知二维图像，却缺乏对三维动态的内部建模。为此，我们提出了MindJourney，一个测试时扩展框架，通过将VLM与基于视频扩散的可控世界模型相结合，赋予其这一缺失的能力。VLM迭代地勾勒出简洁的相机轨迹，而世界模型则在每一步合成对应的视图。随后，VLM基于交互探索过程中收集的多视角证据进行推理。无需任何微调，我们的MindJourney在代表性空间推理基准SAT上平均提升了超过8%的性能，表明将VLM与世界模型配对用于测试时扩展，为稳健的三维推理提供了一条简单即插即用的途径。同时，我们的方法也优于通过强化学习训练的测试时推理VLM，这展示了利用世界模型进行测试时扩展的潜力。

English

Spatial reasoning in 3D space is central to human cognition and indispensable for embodied tasks such as navigation and manipulation. However, state-of-the-art vision-language models (VLMs) struggle frequently with tasks as simple as anticipating how a scene will look after an egocentric motion: they perceive 2D images but lack an internal model of 3D dynamics. We therefore propose MindJourney, a test-time scaling framework that grants a VLM with this missing capability by coupling it to a controllable world model based on video diffusion. The VLM iteratively sketches a concise camera trajectory, while the world model synthesizes the corresponding view at each step. The VLM then reasons over this multi-view evidence gathered during the interactive exploration. Without any fine-tuning, our MindJourney achieves over an average 8% performance boost on the representative spatial reasoning benchmark SAT, showing that pairing VLMs with world models for test-time scaling offers a simple, plug-and-play route to robust 3D reasoning. Meanwhile, our method also improves upon the test-time inference VLMs trained through reinforcement learning, which demonstrates the potential of our method that utilizes world models for test-time scaling.

MindJourney：基于世界模型的空间推理测试时扩展

MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

摘要

Support