4DThinker: 동적 공간 이해를 위한 4D 영상 사고

초록

단안 비디오로부터의 동적 공간 추론은 시각 지능과 물리적 세계를 연결하는 데 필수적이지만, 시각-언어 모델(VLM)에게는 여전히 어려운 과제이다. 기존 접근법은 공간-시간 추론을 전적으로 텍스트로 표현하여 복잡한 역학에 대해 본질적으로 장황하고 부정확하거나, 외부 기하학 모듈에 의존하여 추론 복잡성을 증가시키면서도 모델의 내재적 능력을 키우지 못한다. 본 논문에서는 VLM이 동적 잠재 심상(dynamic latent mental imagery), 즉 연속적인 은닉 공간 내에서 장면이 어떻게 진화하는지 내부적으로 시뮬레이션하는 방식으로 '4D로 생각'할 수 있도록 하는 최초의 프레임워크인 4DThinker를 제시한다. 구체적으로, 우리는 먼저 원시 비디오로부터 4D 추론 데이터를 합성하는 확장 가능하고 주석이 필요 없는 데이터 생성 파이프라인을 소개한다. 그런 다음 텍스트 토큰과 4D 잠재 변수를 공동으로 감독하여 모델을 동적 시각 의미론에 기반시키는 동적 심상 미세 조정(DIFT)을 제안한다. 이를 바탕으로 4D 강화 학습(4DRL)은 결과 기반 보상을 통해 복잡한 추론 작업을 추가로 처리하며, 정책 기울기를 텍스트 토큰으로 제한하여 안정적인 최적화를 보장한다. 여러 동적 공간 추론 벤치마크에 걸친 광범위한 실험을 통해 4DThinker가 강력한 기준 모델을 일관되게 능가하며 VLM에서의 4D 추론에 대한 새로운 관점을 제공함을 입증한다. 코드는 https://github.com/zhangquanchen/4DThinker에서 확인할 수 있다.

English

Dynamic spatial reasoning from monocular video is essential for bridging visual intelligence and the physical world, yet remains challenging for vision-language models (VLMs). Prior approaches either verbalize spatial-temporal reasoning entirely as text, which is inherently verbose and imprecise for complex dynamics, or rely on external geometric modules that increase inference complexity without fostering intrinsic model capability. In this paper, we present 4DThinker, the first framework that enables VLMs to "think with 4D" through dynamic latent mental imagery, i.e., internally simulating how scenes evolve within the continuous hidden space. Specifically, we first introduce a scalable, annotation-free data generation pipeline that synthesizes 4D reasoning data from raw videos. We then propose Dynamic-Imagery Fine-Tuning (DIFT), which jointly supervises textual tokens and 4D latents to ground the model in dynamic visual semantics. Building on this, 4D Reinforcement Learning (4DRL) further tackles complex reasoning tasks via outcome-based rewards, restricting policy gradients to text tokens to ensure stable optimization. Extensive experiments across multiple dynamic spatial reasoning benchmarks demonstrate that 4DThinker consistently outperforms strong baselines and offers a new perspective toward 4D reasoning in VLMs. Our code is available at https://github.com/zhangquanchen/4DThinker.