4DThinker：運用 4D 影像進行動態空間理解的思維模式

摘要

從單眼影片進行動態空間推理，對於連結視覺智慧與物理世界至關重要，然而對視覺語言模型（VLM）而言仍極具挑戰性。現有方法若非將時空推理完全文字化，導致描述複雜動態時過於冗長且不精確，便是依賴外部幾何模組，增加了推理複雜度卻未能提升模型內在能力。本文提出4DThinker，這項首創框架讓VLM能夠透過動態潛在心像進行「4D思考」，亦即在連續隱藏空間內部模擬場景如何演化。具體而言，我們首先引入可擴展、無需標註的資料生成流程，從原始影片合成4D推理資料。接著提出動態心像微調（DIFT），同步監督文字標記與4D潛在變量，使模型奠基於動態視覺語意。在此基礎上，4D強化學習（4DRL）進一步透過結果導向獎勵處理複雜推理任務，並將策略梯度限制於文字標記以確保最佳化穩定。在多項動態空間推理基準的廣泛實驗中，4DThinker持續優於強基線模型，為VLM中的4D推理開闢新視角。我們的程式碼已公開於 https://github.com/zhangquanchen/4DThinker。

English

Dynamic spatial reasoning from monocular video is essential for bridging visual intelligence and the physical world, yet remains challenging for vision-language models (VLMs). Prior approaches either verbalize spatial-temporal reasoning entirely as text, which is inherently verbose and imprecise for complex dynamics, or rely on external geometric modules that increase inference complexity without fostering intrinsic model capability. In this paper, we present 4DThinker, the first framework that enables VLMs to "think with 4D" through dynamic latent mental imagery, i.e., internally simulating how scenes evolve within the continuous hidden space. Specifically, we first introduce a scalable, annotation-free data generation pipeline that synthesizes 4D reasoning data from raw videos. We then propose Dynamic-Imagery Fine-Tuning (DIFT), which jointly supervises textual tokens and 4D latents to ground the model in dynamic visual semantics. Building on this, 4D Reinforcement Learning (4DRL) further tackles complex reasoning tasks via outcome-based rewards, restricting policy gradients to text tokens to ensure stable optimization. Extensive experiments across multiple dynamic spatial reasoning benchmarks demonstrate that 4DThinker consistently outperforms strong baselines and offers a new perspective toward 4D reasoning in VLMs. Our code is available at https://github.com/zhangquanchen/4DThinker.

4DThinker：運用 4D 影像進行動態空間理解的思維模式

4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding

摘要

Support