想象性思维：基于世界模拟器的智能体视觉空间推理

摘要

尽管视觉-语言模型（VLMs）展现出强大的视觉推理能力，但其空间推理能力很大程度上仍局限于观测图像和面向文本的思维链。当仅有有限的自我中心观测可用时，它们往往难以推断未观察到的布局、保持跨视角一致性以及从替代视角进行推理。在这项工作中，我们将该问题视为想象性思考，即VLM在推理过程中通过与世界模拟器交互来主动获取想象出的视觉证据。我们提出Astra，一个赋予VLM以动作条件视觉想象能力的代理式空间推理框架。具体而言，Astra将经过强化学习训练的VLM策略Astra-VL与基于Bagel的世界模拟器Astra-WM相结合，后者能够从上下文图像和自然语言相机运动生成新视角观测。为了提供可靠的想象证据，Astra-WM通过视角一致性调优进行训练，以提升跨视角的姿态和内容一致性。在强化学习阶段，我们提出了一种世界模拟器在环的两阶段强化学习课程，以稳定工具使用探索，并提升模型仅在想象观测优于直接回答时调用模拟器的能力。实验表明，世界模拟器和代理策略均不可或缺：在MMSI-Bench上，Astra-WM将经模拟器增强的Gemini-3-Flash从45.1提升至49.5；而Astra-VL则将Qwen3-VL主干在MMSI-Bench上从29.8提升至38.8，在MindCube上从36.8提升至42.7。这些结果表明，想象观测能够提供有用的空间证据，但有效的世界模型增强推理需要学习何时、何处以及如何想象。

English

While Vision-Language Models (VLMs) have shown strong visual reasoning capabilities, their spatial reasoning abilities remain largely constrained to the observed images and text-oriented chain-of-thought. They often struggle to infer unobserved layouts, maintain cross-view consistency, and reason from alternative viewpoints when only limited egocentric observations are available. In this work, we study this problem as thinking with imagination, where a VLM actively acquires imagined visual evidence by interacting with a world simulator during reasoning. We propose Astra, an agentic spatial reasoning framework that empowers VLMs with action-conditioned visual imagination. Specifically, Astra couples Astra-VL, an RL-trained VLM policy, with Astra-WM, a Bagel-based world simulator that generates novel-view observations from context images and natural-language camera motions. To provide reliable imagined evidence, Astra-WM is trained with view consistency tuning to improve pose and content consistency across views. In the RL stage, we propose a world-simulator-in-the-loop two-phase RL curriculum to stabilize tool-use exploration and advance the model's ability to invoke the simulator only when imagined observations improve over direct answering. Experiments demonstrate that both the world simulator and the agentic policy are necessary: Astra-WM improves simulator-augmented Gemini-3-Flash on MMSI-Bench from 45.1 to 49.5, while Astra-VL improves the Qwen3-VL backbone from 29.8 to 38.8 on MMSI-Bench and from 36.8 to 42.7 on MindCube. These results show that imagined observations can provide useful spatial evidence, but effective world-model-augmented reasoning requires learning when, where, and how to imagine.