想像力による思考：世界シミュレータを用いたエージェントの視覚空間推論

要旨

視覚言語モデル（VLM）は強力な視覚推論能力を示してきたが、その空間推論能力は観測画像とテキスト指向のチェーン・オブ・ソートに大きく制限されたままである。限られた自己中心的な観測のみが利用可能な場合、未観測のレイアウトを推論すること、視点間の一貫性を維持すること、および代替視点から推論することにしばしば困難を伴う。本研究では、この問題を「想像による思考」として捉え、VLMが推論中に世界シミュレータと相互作用することで想像上の視覚的証拠を能動的に獲得する方法を探る。我々は、VLMに行動条件付き視覚想像力を付与するエージェント型空間推論フレームワーク「Astra」を提案する。具体的には、Astraは強化学習（RL）で訓練されたVLMポリシーであるAstra-VLと、コンテキスト画像と自然言語によるカメラ動作から新規視点観測を生成するBagelベースの世界シミュレータAstra-WMを連携させる。信頼性の高い想像上の証拠を提供するため、Astra-WMは視点一貫性チューニングにより訓練され、視点間の姿勢と内容の一貫性を向上させる。RL段階では、世界シミュレータをループに含む二相RLカリキュラムを提案し、ツール使用探索を安定化させるとともに、モデルが直接回答よりも想像上の観測が有効である場合にのみシミュレータを呼び出す能力を向上させる。実験により、世界シミュレータとエージェント型ポリシーの両方が必要であることが示された。Astra-WMはシミュレータ拡張型Gemini-3-FlashのMMSI-Benchスコアを45.1から49.5に向上させ、Astra-VLはバックボーンのQwen3-VLをMMSI-Benchで29.8から38.8、MindCubeで36.8から42.7に改善した。これらの結果は、想像上の観測が有用な空間的証拠を提供できることを示す一方、効果的な世界モデル拡張型推論には、いつ、どこで、どのように想像すべきかを学習することが必要であることを示している。

English

While Vision-Language Models (VLMs) have shown strong visual reasoning capabilities, their spatial reasoning abilities remain largely constrained to the observed images and text-oriented chain-of-thought. They often struggle to infer unobserved layouts, maintain cross-view consistency, and reason from alternative viewpoints when only limited egocentric observations are available. In this work, we study this problem as thinking with imagination, where a VLM actively acquires imagined visual evidence by interacting with a world simulator during reasoning. We propose Astra, an agentic spatial reasoning framework that empowers VLMs with action-conditioned visual imagination. Specifically, Astra couples Astra-VL, an RL-trained VLM policy, with Astra-WM, a Bagel-based world simulator that generates novel-view observations from context images and natural-language camera motions. To provide reliable imagined evidence, Astra-WM is trained with view consistency tuning to improve pose and content consistency across views. In the RL stage, we propose a world-simulator-in-the-loop two-phase RL curriculum to stabilize tool-use exploration and advance the model's ability to invoke the simulator only when imagined observations improve over direct answering. Experiments demonstrate that both the world simulator and the agentic policy are necessary: Astra-WM improves simulator-augmented Gemini-3-Flash on MMSI-Bench from 45.1 to 49.5, while Astra-VL improves the Qwen3-VL backbone from 29.8 to 38.8 on MMSI-Bench and from 36.8 to 42.7 on MindCube. These results show that imagined observations can provide useful spatial evidence, but effective world-model-augmented reasoning requires learning when, where, and how to imagine.