MindJourney: 공간 추론을 위한 월드 모델과 테스트 타임 스케일링

초록

3차원 공간에서의 공간 추론은 인간 인지의 핵심이며, 탐색 및 조작과 같은 구체적 작업에 필수적이다. 그러나 최첨단 시각-언어 모델(VLMs)은 단순한 자기 중심적 움직임 이후 장면이 어떻게 보일지 예측하는 작업에서도 종종 어려움을 겪는다. 이들은 2D 이미지를 인식하지만 3D 역학에 대한 내부 모델이 부족하다. 따라서 우리는 비디오 확산 기반의 제어 가능한 세계 모델과 결합하여 VLM에 이 부족한 능력을 부여하는 테스트 시간 확장 프레임워크인 MindJourney를 제안한다. VLM은 간결한 카메라 궤적을 반복적으로 스케치하고, 세계 모델은 각 단계에서 해당 뷰를 합성한다. VLM은 이렇게 상호작용적 탐색 중 수집된 다중 뷰 증거를 기반으로 추론을 수행한다. 파인튜닝 없이도, 우리의 MindJourney는 대표적인 공간 추론 벤치마크인 SAT에서 평균 8% 이상의 성능 향상을 달성하며, 테스트 시간 확장을 위해 VLM과 세계 모델을 결합하는 것이 강력한 3D 추론을 위한 간단한 플러그앤플레이 방식임을 보여준다. 또한, 우리의 방법은 강화 학습을 통해 훈련된 테스트 시간 추론 VLM을 개선하여, 테스트 시간 확장을 위해 세계 모델을 활용하는 우리 방법의 잠재력을 입증한다.

English

Spatial reasoning in 3D space is central to human cognition and indispensable for embodied tasks such as navigation and manipulation. However, state-of-the-art vision-language models (VLMs) struggle frequently with tasks as simple as anticipating how a scene will look after an egocentric motion: they perceive 2D images but lack an internal model of 3D dynamics. We therefore propose MindJourney, a test-time scaling framework that grants a VLM with this missing capability by coupling it to a controllable world model based on video diffusion. The VLM iteratively sketches a concise camera trajectory, while the world model synthesizes the corresponding view at each step. The VLM then reasons over this multi-view evidence gathered during the interactive exploration. Without any fine-tuning, our MindJourney achieves over an average 8% performance boost on the representative spatial reasoning benchmark SAT, showing that pairing VLMs with world models for test-time scaling offers a simple, plug-and-play route to robust 3D reasoning. Meanwhile, our method also improves upon the test-time inference VLMs trained through reinforcement learning, which demonstrates the potential of our method that utilizes world models for test-time scaling.

MindJourney: 공간 추론을 위한 월드 모델과 테스트 타임 스케일링

MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

초록

Support