4DThinker: 動的空間理解のための4Dイメージ思考

要旨

単眼映像からの動的空間推論は、視覚知能と物理世界を橋渡しするために不可欠であるが、視覚言語モデル(VLM)にとって依然として困難な課題である。従来のアプローチは、時空間推論をすべてテキストとして言語化するか(本質的に冗長で複雑な動きに対して不正確である)、または外部の幾何学的モジュールに依存するため、推論の複雑さが増し、モデル本来の能力を育成することはできない。本論文では、4DThinkerを提案する。これは、VLMが動的な潜在的心理イメージ、すなわち連続的な潜在空間内でシーンがどのように進化するかを内部的にシミュレートすることにより、「4Dで思考する」ことを可能にする初のフレームワークである。具体的には、まずスケーラブルでアノテーション不要のデータ生成パイプラインを導入し、生の映像から4D推論データを合成する。次に、動的イメージリ微調整(DIFT)を提案する。これは、テキストトークンと4D潜在表現の両方を同時に監視し、モデルを動的視覚意味論に基づいて接地させる。これに基づき、4D強化学習(4DRL)は、結果ベースの報酬を通じて複雑な推論タスクにさらに取り組み、方策勾配をテキストトークンに制限することで安定的な最適化を保証する。複数の動的空間推論ベンチマークにおける広範な実験により、4DThinkerが強力なベースラインを一貫して上回り、VLMにおける4D推論への新たな視点を提供することが示された。コードはhttps://github.com/zhangquanchen/4DThinkerで公開している。

English

Dynamic spatial reasoning from monocular video is essential for bridging visual intelligence and the physical world, yet remains challenging for vision-language models (VLMs). Prior approaches either verbalize spatial-temporal reasoning entirely as text, which is inherently verbose and imprecise for complex dynamics, or rely on external geometric modules that increase inference complexity without fostering intrinsic model capability. In this paper, we present 4DThinker, the first framework that enables VLMs to "think with 4D" through dynamic latent mental imagery, i.e., internally simulating how scenes evolve within the continuous hidden space. Specifically, we first introduce a scalable, annotation-free data generation pipeline that synthesizes 4D reasoning data from raw videos. We then propose Dynamic-Imagery Fine-Tuning (DIFT), which jointly supervises textual tokens and 4D latents to ground the model in dynamic visual semantics. Building on this, 4D Reinforcement Learning (4DRL) further tackles complex reasoning tasks via outcome-based rewards, restricting policy gradients to text tokens to ensure stable optimization. Extensive experiments across multiple dynamic spatial reasoning benchmarks demonstrate that 4DThinker consistently outperforms strong baselines and offers a new perspective toward 4D reasoning in VLMs. Our code is available at https://github.com/zhangquanchen/4DThinker.