ChatPaper.aiChatPaper

4D-RGPT:基于感知蒸馏的区域级四维理解框架

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

December 18, 2025
作者: Chiao-An Yang, Ryo Hachiuma, Sifei Liu, Subhashree Radhakrishnan, Raymond A. Yeh, Yu-Chiang Frank Wang, Min-Hung Chen
cs.AI

摘要

尽管多模态大语言模型(MLLMs)取得了进展,但其在三维结构和时序动态推理方面的能力仍存在局限,这主要受制于薄弱的四维感知与时序理解能力。现有的三维及四维视频问答(VQA)基准也侧重于静态场景,且缺乏区域级提示机制。为解决这些问题,我们提出:(a)4D-RGPT——一种专为从视频输入中捕捉四维表征而设计的MLLM,具备增强的时序感知能力;(b)感知四维蒸馏(P4D)——通过将冻结专家模型的四维表征迁移至4D-RGPT的训练框架,实现全面的四维感知;(c)R4D-Bench——基于混合自动生成与人工验证流程构建的深度感知动态场景基准,支持区域级提示。我们的4D-RGPT在现有四维VQA基准和新建的R4D-Bench基准上均取得了显著提升。
English
Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting. We tackle these issues by introducing: (a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception; (b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and (c) R4D-Bench, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline. Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchmark.
PDF281December 23, 2025