MindJourney:基於世界模型的空間推理測試時擴展
MindJourney: Test-Time Scaling with World Models for Spatial Reasoning
July 16, 2025
作者: Yuncong Yang, Jiageng Liu, Zheyuan Zhang, Siyuan Zhou, Reuben Tan, Jianwei Yang, Yilun Du, Chuang Gan
cs.AI
摘要
三維空間中的空間推理是人類認知的核心,對於導航和操作等具身任務不可或缺。然而,最先進的視覺-語言模型(VLMs)在處理諸如預測自我中心運動後場景變化這樣簡單的任務時,常常顯得力不從心:它們能感知二維圖像,但缺乏對三維動態的內在建模。因此,我們提出了MindJourney,這是一個測試時擴展框架,通過將VLM與基於視頻擴散的可控世界模型相結合,賦予其這一缺失的能力。VLM迭代地勾勒出簡潔的相機軌跡,而世界模型則在每一步合成相應的視圖。隨後,VLM基於在交互探索過程中收集的多視角證據進行推理。無需任何微調,我們的MindJourney在代表性空間推理基準SAT上平均提升了超過8%的性能,表明將VLM與世界模型配對用於測試時擴展,為實現穩健的三維推理提供了一條簡單、即插即用的途徑。同時,我們的方法也優化了通過強化學習訓練的測試時推理VLM,這展示了利用世界模型進行測試時擴展的潛力。
English
Spatial reasoning in 3D space is central to human cognition and indispensable
for embodied tasks such as navigation and manipulation. However,
state-of-the-art vision-language models (VLMs) struggle frequently with tasks
as simple as anticipating how a scene will look after an egocentric motion:
they perceive 2D images but lack an internal model of 3D dynamics. We therefore
propose MindJourney, a test-time scaling framework that grants a VLM with this
missing capability by coupling it to a controllable world model based on video
diffusion. The VLM iteratively sketches a concise camera trajectory, while the
world model synthesizes the corresponding view at each step. The VLM then
reasons over this multi-view evidence gathered during the interactive
exploration. Without any fine-tuning, our MindJourney achieves over an average
8% performance boost on the representative spatial reasoning benchmark SAT,
showing that pairing VLMs with world models for test-time scaling offers a
simple, plug-and-play route to robust 3D reasoning. Meanwhile, our method also
improves upon the test-time inference VLMs trained through reinforcement
learning, which demonstrates the potential of our method that utilizes world
models for test-time scaling.