藉由想像力思考：利用世界模拟器進行具身視覺空間推理

摘要

雖然視覺語言模型（VLM）已展現出強大的視覺推理能力，但其空間推理能力仍高度受限於可觀察影像及以文字為導向的思維鏈。當僅有有限的自我中心觀察時，這些模型往往難以推論未觀測到的佈局、維持跨視角的一致性，以及從替代視角進行推理。在本研究中，我們將此問題視為「透過想像進行思考」：亦即讓視覺語言模型在推理過程中，藉由與世界模擬器互動，主動獲取想像中的視覺證據。我們提出Astra，一種具備代理能力的空間推理框架，賦予視覺語言模型動作條件化的視覺想像能力。具體而言，Astra結合了Astra-VL（一種經強化學習訓練的視覺語言模型策略）與Astra-WM（一個基於Bagel的世界模擬器），後者可從上下文影像與自然語言中的相機運動生成新視角的觀察。為提供可靠的想像證據，Astra-WM透過視角一致性微調進行訓練，以提升不同視角間的位姿與內容一致性。在強化學習階段，我們提出一套以世界模擬器為核心的兩階段強化學習課程，以穩定工具使用的探索過程，並提升模型僅在想像觀察優於直接作答時才調用模擬器的能力。實驗結果表明，世界模擬器與代理策略兩者皆不可或缺：Astra-WM將經模擬器增強的Gemini-3-Flash在MMSI-Bench上的表現從45.1提升至49.5；而Astra-VL則將Qwen3-VL基礎模型在MMSI-Bench上的分數從29.8提升至38.8，在MindCube上從36.8提升至42.7。這些結果顯示，想像觀察能提供有用的空間證據，但要實現有效的世界模型增強推理，仍需學習在何時、何處以及如何進行想像。

English

While Vision-Language Models (VLMs) have shown strong visual reasoning capabilities, their spatial reasoning abilities remain largely constrained to the observed images and text-oriented chain-of-thought. They often struggle to infer unobserved layouts, maintain cross-view consistency, and reason from alternative viewpoints when only limited egocentric observations are available. In this work, we study this problem as thinking with imagination, where a VLM actively acquires imagined visual evidence by interacting with a world simulator during reasoning. We propose Astra, an agentic spatial reasoning framework that empowers VLMs with action-conditioned visual imagination. Specifically, Astra couples Astra-VL, an RL-trained VLM policy, with Astra-WM, a Bagel-based world simulator that generates novel-view observations from context images and natural-language camera motions. To provide reliable imagined evidence, Astra-WM is trained with view consistency tuning to improve pose and content consistency across views. In the RL stage, we propose a world-simulator-in-the-loop two-phase RL curriculum to stabilize tool-use exploration and advance the model's ability to invoke the simulator only when imagined observations improve over direct answering. Experiments demonstrate that both the world simulator and the agentic policy are necessary: Astra-WM improves simulator-augmented Gemini-3-Flash on MMSI-Bench from 45.1 to 49.5, while Astra-VL improves the Qwen3-VL backbone from 29.8 to 38.8 on MMSI-Bench and from 36.8 to 42.7 on MindCube. These results show that imagined observations can provide useful spatial evidence, but effective world-model-augmented reasoning requires learning when, where, and how to imagine.