使用扩散模型控制空间和时间
Controlling Space and Time with Diffusion Models
July 10, 2024
作者: Daniel Watson, Saurabh Saxena, Lala Li, Andrea Tagliasacchi, David J. Fleet
cs.AI
摘要
我们提出了4DiM,这是一个级联扩散模型,用于4D新视角合成(NVS),以一张或多张通用场景图像为条件,并配以一组相机姿势和时间戳。为了克服由于4D训练数据有限而带来的挑战,我们主张在3D(带相机姿势)、4D(姿势+时间)和视频(只有时间而无姿势)数据上进行联合训练,并提出了一种新的架构来实现这一点。我们进一步主张使用单目度量深度估计器校准SfM姿势数据,以实现度量尺度相机控制。为了对模型进行评估,我们引入了新的指标来丰富和克服当前评估方案的缺点,展示了与现有3D NVS扩散模型相比在保真度和姿势控制方面的最新成果,同时增加了处理时间动态的能力。4DiM还用于改进全景拼接、姿势条件视频到视频的翻译以及其他几项任务。有关概述,请参阅https://4d-diffusion.github.io。
English
We present 4DiM, a cascaded diffusion model for 4D novel view synthesis
(NVS), conditioned on one or more images of a general scene, and a set of
camera poses and timestamps. To overcome challenges due to limited availability
of 4D training data, we advocate joint training on 3D (with camera pose), 4D
(pose+time) and video (time but no pose) data and propose a new architecture
that enables the same. We further advocate the calibration of SfM posed data
using monocular metric depth estimators for metric scale camera control. For
model evaluation, we introduce new metrics to enrich and overcome shortcomings
of current evaluation schemes, demonstrating state-of-the-art results in both
fidelity and pose control compared to existing diffusion models for 3D NVS,
while at the same time adding the ability to handle temporal dynamics. 4DiM is
also used for improved panorama stitching, pose-conditioned video to video
translation, and several other tasks. For an overview see
https://4d-diffusion.github.ioSummary
AI-Generated Summary