擴散模型在光流和單目深度估計中的驚人有效性

摘要

去噪擴散概率模型以其出色的保真度和多樣性改變了圖像生成。我們展示它們在估算光流和單眼深度方面也表現出色，令人驚訝的是，無需針對這些任務主導的特定架構和損失函數。與傳統基於回歸的點估計方法相比，擴散模型還能進行蒙特卡羅推斷，例如捕捉光流和深度中的不確定性和模糊性。通過自監督預訓練，綜合使用合成和真實數據進行監督訓練，以及技術創新（填充和步驟展開去噪擴散訓練）來處理噪聲不完整的訓練數據，以及一種簡單的粗到細的改進形式，可以訓練出用於深度和光流估算的最先進的擴散模型。廣泛的實驗集中在與基準、消融和模型捕捉不確定性和多樣性以及填補缺失值的能力的定量性能上。我們的模型，DDVM（去噪擴散視覺模型），在室內NYU基準上獲得了0.074的最先進相對深度誤差，並在KITTI光流基準上獲得了3.26％的Fl-all異常值率，比最佳發表方法好約25％。有關概述請參見 https://diffusion-vision.github.io。

English

Denoising diffusion probabilistic models have transformed image generation with their impressive fidelity and diversity. We show that they also excel in estimating optical flow and monocular depth, surprisingly, without task-specific architectures and loss functions that are predominant for these tasks. Compared to the point estimates of conventional regression-based methods, diffusion models also enable Monte Carlo inference, e.g., capturing uncertainty and ambiguity in flow and depth. With self-supervised pre-training, the combined use of synthetic and real data for supervised training, and technical innovations (infilling and step-unrolled denoising diffusion training) to handle noisy-incomplete training data, and a simple form of coarse-to-fine refinement, one can train state-of-the-art diffusion models for depth and optical flow estimation. Extensive experiments focus on quantitative performance against benchmarks, ablations, and the model's ability to capture uncertainty and multimodality, and impute missing values. Our model, DDVM (Denoising Diffusion Vision Model), obtains a state-of-the-art relative depth error of 0.074 on the indoor NYU benchmark and an Fl-all outlier rate of 3.26\% on the KITTI optical flow benchmark, about 25\% better than the best published method. For an overview see https://diffusion-vision.github.io.

擴散模型在光流和單目深度估計中的驚人有效性

The Surprising Effectiveness of Diffusion Models for Optical Flow and Monocular Depth Estimation

摘要

Support