MindJourney：基於世界模型的空間推理測試時擴展

摘要

三維空間中的空間推理是人類認知的核心，對於導航和操作等具身任務不可或缺。然而，最先進的視覺-語言模型（VLMs）在處理諸如預測自我中心運動後場景變化這樣簡單的任務時，常常顯得力不從心：它們能感知二維圖像，但缺乏對三維動態的內在建模。因此，我們提出了MindJourney，這是一個測試時擴展框架，通過將VLM與基於視頻擴散的可控世界模型相結合，賦予其這一缺失的能力。VLM迭代地勾勒出簡潔的相機軌跡，而世界模型則在每一步合成相應的視圖。隨後，VLM基於在交互探索過程中收集的多視角證據進行推理。無需任何微調，我們的MindJourney在代表性空間推理基準SAT上平均提升了超過8%的性能，表明將VLM與世界模型配對用於測試時擴展，為實現穩健的三維推理提供了一條簡單、即插即用的途徑。同時，我們的方法也優化了通過強化學習訓練的測試時推理VLM，這展示了利用世界模型進行測試時擴展的潛力。

English

Spatial reasoning in 3D space is central to human cognition and indispensable for embodied tasks such as navigation and manipulation. However, state-of-the-art vision-language models (VLMs) struggle frequently with tasks as simple as anticipating how a scene will look after an egocentric motion: they perceive 2D images but lack an internal model of 3D dynamics. We therefore propose MindJourney, a test-time scaling framework that grants a VLM with this missing capability by coupling it to a controllable world model based on video diffusion. The VLM iteratively sketches a concise camera trajectory, while the world model synthesizes the corresponding view at each step. The VLM then reasons over this multi-view evidence gathered during the interactive exploration. Without any fine-tuning, our MindJourney achieves over an average 8% performance boost on the representative spatial reasoning benchmark SAT, showing that pairing VLMs with world models for test-time scaling offers a simple, plug-and-play route to robust 3D reasoning. Meanwhile, our method also improves upon the test-time inference VLMs trained through reinforcement learning, which demonstrates the potential of our method that utilizes world models for test-time scaling.

MindJourney：基於世界模型的空間推理測試時擴展

MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

摘要

Support