ChatPaper.aiChatPaper

AdaViewPlanner:面向4D场景视角规划的视频扩散模型自适应方法

AdaViewPlanner: Adapting Video Diffusion Models for Viewpoint Planning in 4D Scenes

October 12, 2025
作者: Yu Li, Menghan Xia, Gongye Liu, Jianhong Bai, Xintao Wang, Conglang Zhang, Yuxuan Lin, Ruihang Chu, Pengfei Wan, Yujiu Yang
cs.AI

摘要

近期,文本到视频(T2V)模型在真实世界几何与物理定律的视觉模拟方面展现了强大能力,暗示其作为隐式世界模型的潜力。受此启发,我们探索了利用视频生成先验从给定4D场景中进行视点规划的可行性,因为视频本身伴随着动态场景与自然视点。为此,我们提出了一种两阶段范式,以兼容的方式调整预训练T2V模型用于视点预测。首先,我们通过一个自适应学习分支将4D场景表示注入预训练T2V模型中,其中4D场景是视点无关的,而条件生成的视频则视觉上嵌入了视点。接着,我们将视点提取表述为一个混合条件引导的相机外参去噪过程。具体而言,在预训练T2V模型上进一步引入了一个相机外参扩散分支,以生成的视频和4D场景作为输入。实验结果表明,我们提出的方法优于现有竞争者,消融研究验证了关键技术设计的有效性。在某种程度上,这项工作证明了视频生成模型在现实世界4D交互中的潜力。
English
Recent Text-to-Video (T2V) models have demonstrated powerful capability in visual simulation of real-world geometry and physical laws, indicating its potential as implicit world models. Inspired by this, we explore the feasibility of leveraging the video generation prior for viewpoint planning from given 4D scenes, since videos internally accompany dynamic scenes with natural viewpoints. To this end, we propose a two-stage paradigm to adapt pre-trained T2V models for viewpoint prediction, in a compatible manner. First, we inject the 4D scene representation into the pre-trained T2V model via an adaptive learning branch, where the 4D scene is viewpoint-agnostic and the conditional generated video embeds the viewpoints visually. Then, we formulate viewpoint extraction as a hybrid-condition guided camera extrinsic denoising process. Specifically, a camera extrinsic diffusion branch is further introduced onto the pre-trained T2V model, by taking the generated video and 4D scene as input. Experimental results show the superiority of our proposed method over existing competitors, and ablation studies validate the effectiveness of our key technical designs. To some extent, this work proves the potential of video generation models toward 4D interaction in real world.
PDF162October 14, 2025