4D-RGPT:通过感知蒸馏实现区域级四维场景理解
4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
December 18, 2025
作者: Chiao-An Yang, Ryo Hachiuma, Sifei Liu, Subhashree Radhakrishnan, Raymond A. Yeh, Yu-Chiang Frank Wang, Min-Hung Chen
cs.AI
摘要
尽管多模态大语言模型(MLLMs)取得了进展,但其在三维结构和时序动态推理方面的能力仍受限于薄弱的四维感知与时序理解。现有3D/4D视频问答基准同样侧重于静态场景且缺乏区域级提示机制。为解决这些问题,我们提出:(a)4D-RGPT——专为从视频输入中捕捉四维表征而设计的MLLM,具备增强的时序感知能力;(b)感知四维蒸馏(P4D)训练框架,将冻结专家模型的四维表征迁移至4D-RGPT以实现全面四维感知;(c)R4D-Bench——通过人机协同验证流程构建的深度感知动态场景基准,支持区域级提示。实验表明,我们的4D-RGPT在现有4D VQA基准及新提出的R4D-Bench基准上均取得显著提升。
English
Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting. We tackle these issues by introducing: (a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception; (b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and (c) R4D-Bench, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline. Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchmark.