使用擴散模型控制空間和時間
Controlling Space and Time with Diffusion Models
July 10, 2024
作者: Daniel Watson, Saurabh Saxena, Lala Li, Andrea Tagliasacchi, David J. Fleet
cs.AI
摘要
我們提出了4DiM,一種用於4D新視角合成(NVS)的串聯擴散模型,條件是一個或多個一般場景的影像,以及一組相機姿勢和時間戳。為了克服由於4D訓練數據有限而帶來的挑戰,我們主張在3D(具有相機姿勢)、4D(姿勢+時間)和視頻(僅時間無姿勢)數據上進行聯合訓練,並提出一種新的架構來實現這一點。我們進一步主張使用單眼度量深度估算器校準SfM姿勢數據,以實現度量尺度相機控制。為了對模型進行評估,我們引入了新的指標來豐富並克服當前評估方案的缺陷,展示了與現有3D NVS擴散模型相比在保真度和姿勢控制方面的最新成果,同時增加了處理時間動態的能力。4DiM還用於改進全景拼接、姿勢條件下的視頻到視頻翻譯以及其他幾項任務。有關概述,請參見https://4d-diffusion.github.io。
English
We present 4DiM, a cascaded diffusion model for 4D novel view synthesis
(NVS), conditioned on one or more images of a general scene, and a set of
camera poses and timestamps. To overcome challenges due to limited availability
of 4D training data, we advocate joint training on 3D (with camera pose), 4D
(pose+time) and video (time but no pose) data and propose a new architecture
that enables the same. We further advocate the calibration of SfM posed data
using monocular metric depth estimators for metric scale camera control. For
model evaluation, we introduce new metrics to enrich and overcome shortcomings
of current evaluation schemes, demonstrating state-of-the-art results in both
fidelity and pose control compared to existing diffusion models for 3D NVS,
while at the same time adding the ability to handle temporal dynamics. 4DiM is
also used for improved panorama stitching, pose-conditioned video to video
translation, and several other tasks. For an overview see
https://4d-diffusion.github.ioSummary
AI-Generated Summary