4DThinker：利用四维图像进行动态空间理解

摘要

从单目视频中进行动态空间推理对于连接视觉智能与物理世界至关重要，但对视觉语言模型（VLM）而言仍具有挑战性。现有方法要么将空间-时间推理完全文字化，这在处理复杂动态时本质上是冗长且不精确的；要么依赖外部几何模块，这增加了推理复杂度且未能培养模型内在能力。本文提出4DThinker，这是首个使VLM能够通过动态潜在心理图像“以4D方式思考”的框架——即在连续隐空间中内部模拟场景演化过程。具体而言，我们首先引入一个可扩展、无需标注的数据生成管道，从原始视频中合成4D推理数据。随后提出动态意象微调（DIFT），通过联合监督文本标记和4D潜在表示，使模型植根于动态视觉语义。在此基础上，4D强化学习（4DRL）通过基于结果的奖励进一步处理复杂推理任务，并将策略梯度限制于文本标记以确保稳定优化。在多个动态空间推理基准上的大量实验表明，4DThinker持续优于强基线模型，并为VLM中的4D推理提供了新视角。我们的代码已开源：https://github.com/zhangquanchen/4DThinker。

English

Dynamic spatial reasoning from monocular video is essential for bridging visual intelligence and the physical world, yet remains challenging for vision-language models (VLMs). Prior approaches either verbalize spatial-temporal reasoning entirely as text, which is inherently verbose and imprecise for complex dynamics, or rely on external geometric modules that increase inference complexity without fostering intrinsic model capability. In this paper, we present 4DThinker, the first framework that enables VLMs to "think with 4D" through dynamic latent mental imagery, i.e., internally simulating how scenes evolve within the continuous hidden space. Specifically, we first introduce a scalable, annotation-free data generation pipeline that synthesizes 4D reasoning data from raw videos. We then propose Dynamic-Imagery Fine-Tuning (DIFT), which jointly supervises textual tokens and 4D latents to ground the model in dynamic visual semantics. Building on this, 4D Reinforcement Learning (4DRL) further tackles complex reasoning tasks via outcome-based rewards, restricting policy gradients to text tokens to ensure stable optimization. Extensive experiments across multiple dynamic spatial reasoning benchmarks demonstrate that 4DThinker consistently outperforms strong baselines and offers a new perspective toward 4D reasoning in VLMs. Our code is available at https://github.com/zhangquanchen/4DThinker.

4DThinker：利用四维图像进行动态空间理解

4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding

摘要

Support