상상력을 통한 사고: 세계 시뮬레이터를 활용한 에이전트적 시각 공간 추론

초록

시각-언어 모델(VLM)은 강력한 시각적 추론 능력을 보여주고 있지만, 공간 추론 능력은 여전히 관찰된 이미지와 텍스트 중심의 사고 사슬에 크게 제약되어 있다. 제한된 자기중심적 관찰만 가능할 때, 관찰되지 않은 배치를 추론하고, 뷰 간 일관성을 유지하며, 대체 시점에서 추론하는 데 어려움을 겪는 경우가 많다. 본 연구에서는 이러한 문제를 상상을 통한 사고로 접근하며, VLM이 추론 과정에서 세계 시뮬레이터와 상호작용하여 상상된 시각적 증거를 능동적으로 획득하도록 한다. 우리는 VLM에 행동 조건부 시각적 상상력을 부여하는 에이전트 기반 공간 추론 프레임워크인 Astra를 제안한다. 구체적으로, Astra는 RL(강화학습)으로 훈련된 VLM 정책인 Astra-VL과 Bagel 기반 세계 시뮬레이터로서 맥락 이미지와 자연어 카메라 움직임으로부터 새로운 시점 관찰을 생성하는 Astra-WM을 결합한다. 신뢰할 수 있는 상상 증거를 제공하기 위해, Astra-WM은 뷰 일관성 튜닝으로 훈련되어 뷰 간 자세 및 내용 일관성을 향상시킨다. RL 단계에서는 세계 시뮬레이터를 포함한 2단계 RL 커리큘럼을 제안하여 도구 사용 탐색을 안정화하고, 상상된 관찰이 직접 응답보다 개선될 때에만 시뮬레이터를 호출하는 모델의 능력을 향상시킨다. 실험 결과, 세계 시뮬레이터와 에이전트 정책 모두 필요함을 보여준다. Astra-WM은 시뮬레이터로 보강된 Gemini-3-Flash의 MMSI-Bench 성능을 45.1에서 49.5로 향상시켰으며, Astra-VL은 Qwen3-VL 백본의 성능을 MMSI-Bench에서 29.8에서 38.8로, MindCube에서 36.8에서 42.7로 향상시켰다. 이러한 결과는 상상된 관찰이 유용한 공간 증거를 제공할 수 있지만, 효과적인 세계 모델 보강 추론을 위해서는 언제, 어디서, 어떻게 상상할지 학습해야 함을 보여준다.

English

While Vision-Language Models (VLMs) have shown strong visual reasoning capabilities, their spatial reasoning abilities remain largely constrained to the observed images and text-oriented chain-of-thought. They often struggle to infer unobserved layouts, maintain cross-view consistency, and reason from alternative viewpoints when only limited egocentric observations are available. In this work, we study this problem as thinking with imagination, where a VLM actively acquires imagined visual evidence by interacting with a world simulator during reasoning. We propose Astra, an agentic spatial reasoning framework that empowers VLMs with action-conditioned visual imagination. Specifically, Astra couples Astra-VL, an RL-trained VLM policy, with Astra-WM, a Bagel-based world simulator that generates novel-view observations from context images and natural-language camera motions. To provide reliable imagined evidence, Astra-WM is trained with view consistency tuning to improve pose and content consistency across views. In the RL stage, we propose a world-simulator-in-the-loop two-phase RL curriculum to stabilize tool-use exploration and advance the model's ability to invoke the simulator only when imagined observations improve over direct answering. Experiments demonstrate that both the world simulator and the agentic policy are necessary: Astra-WM improves simulator-augmented Gemini-3-Flash on MMSI-Bench from 45.1 to 49.5, while Astra-VL improves the Qwen3-VL backbone from 29.8 to 38.8 on MMSI-Bench and from 36.8 to 42.7 on MindCube. These results show that imagined observations can provide useful spatial evidence, but effective world-model-augmented reasoning requires learning when, where, and how to imagine.