MindJourney: 空間推論のためのワールドモデルを用いたテストタイムスケーリング

要旨

3次元空間における空間推論は、人間の認知の中核をなすものであり、ナビゲーションや操作といった身体性を伴うタスクにおいて不可欠である。しかし、最先端の視覚言語モデル（VLM）は、自己中心的な動きの後のシーンがどのように見えるかを予測するといった単純なタスクでさえ頻繁に苦戦する。これらは2次元画像を認識するが、3次元のダイナミクスを内部モデルとして持っていない。そこで我々は、MindJourneyというテストタイムスケーリングフレームワークを提案する。これは、ビデオ拡散に基づく制御可能な世界モデルとVLMを結合することで、この欠けている能力をVLMに付与するものである。VLMは簡潔なカメラ軌道を反復的にスケッチし、世界モデルは各ステップで対応するビューを合成する。VLMはその後、インタラクティブな探索中に収集されたこのマルチビュー証拠を推論する。ファインチューニングなしで、我々のMindJourneyは代表的な空間推論ベンチマークSATにおいて平均8%以上の性能向上を達成し、VLMと世界モデルを組み合わせたテストタイムスケーリングが、堅牢な3次元推論へのシンプルでプラグアンドプレイな道を提供することを示している。同時に、我々の手法は強化学習を通じて訓練されたテストタイム推論VLMをも改善し、世界モデルを活用したテストタイムスケーリングの可能性を実証している。

English

Spatial reasoning in 3D space is central to human cognition and indispensable for embodied tasks such as navigation and manipulation. However, state-of-the-art vision-language models (VLMs) struggle frequently with tasks as simple as anticipating how a scene will look after an egocentric motion: they perceive 2D images but lack an internal model of 3D dynamics. We therefore propose MindJourney, a test-time scaling framework that grants a VLM with this missing capability by coupling it to a controllable world model based on video diffusion. The VLM iteratively sketches a concise camera trajectory, while the world model synthesizes the corresponding view at each step. The VLM then reasons over this multi-view evidence gathered during the interactive exploration. Without any fine-tuning, our MindJourney achieves over an average 8% performance boost on the representative spatial reasoning benchmark SAT, showing that pairing VLMs with world models for test-time scaling offers a simple, plug-and-play route to robust 3D reasoning. Meanwhile, our method also improves upon the test-time inference VLMs trained through reinforcement learning, which demonstrates the potential of our method that utilizes world models for test-time scaling.

MindJourney: 空間推論のためのワールドモデルを用いたテストタイムスケーリング

MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

要旨

Support