使用扩散模型控制空间和时间

摘要

我们提出了4DiM，这是一个级联扩散模型，用于4D新视角合成（NVS），以一张或多张通用场景图像为条件，并配以一组相机姿势和时间戳。为了克服由于4D训练数据有限而带来的挑战，我们主张在3D（带相机姿势）、4D（姿势+时间）和视频（只有时间而无姿势）数据上进行联合训练，并提出了一种新的架构来实现这一点。我们进一步主张使用单目度量深度估计器校准SfM姿势数据，以实现度量尺度相机控制。为了对模型进行评估，我们引入了新的指标来丰富和克服当前评估方案的缺点，展示了与现有3D NVS扩散模型相比在保真度和姿势控制方面的最新成果，同时增加了处理时间动态的能力。4DiM还用于改进全景拼接、姿势条件视频到视频的翻译以及其他几项任务。有关概述，请参阅https://4d-diffusion.github.io。

English

We present 4DiM, a cascaded diffusion model for 4D novel view synthesis (NVS), conditioned on one or more images of a general scene, and a set of camera poses and timestamps. To overcome challenges due to limited availability of 4D training data, we advocate joint training on 3D (with camera pose), 4D (pose+time) and video (time but no pose) data and propose a new architecture that enables the same. We further advocate the calibration of SfM posed data using monocular metric depth estimators for metric scale camera control. For model evaluation, we introduce new metrics to enrich and overcome shortcomings of current evaluation schemes, demonstrating state-of-the-art results in both fidelity and pose control compared to existing diffusion models for 3D NVS, while at the same time adding the ability to handle temporal dynamics. 4DiM is also used for improved panorama stitching, pose-conditioned video to video translation, and several other tasks. For an overview see https://4d-diffusion.github.io

使用扩散模型控制空间和时间

Controlling Space and Time with Diffusion Models

摘要

Support