BetterDepth:零样本单目深度估计的即插即用扩散细化器
BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation
July 25, 2024
作者: Xiang Zhang, Bingxin Ke, Hayko Riemenschneider, Nando Metzger, Anton Obukhov, Markus Gross, Konrad Schindler, Christopher Schroers
cs.AI
摘要
通过在大规模数据集上训练,零样本单目深度估计(MDE)方法在野外表现出强大的性能,但往往在细节方面缺乏足够的精确性。尽管最近基于扩散的MDE方法展现出吸引人的细节提取能力,但由于难以从多样数据集中获得稳健的几何先验知识,它们在几何上具有挑战性的场景中仍然面临困难。为了充分利用这两个领域的互补优势,我们提出了BetterDepth,以有效实现几何正确的仿射不变MDE性能,同时捕获细粒度的细节。具体而言,BetterDepth是一种有条件的基于扩散的细化器,它以预训练的MDE模型的预测作为深度条件,其中全局深度上下文被很好地捕获,并根据输入图像迭代地细化细节。为了训练这样一个细化器,我们提出了全局预对齐和局部补丁遮罩方法,以确保BetterDepth对深度条件的忠实性,同时学习捕获细粒度的场景细节。通过在小规模合成数据集上进行高效训练,BetterDepth在各种公共数据集和野外场景中实现了最先进的零样本MDE性能。此外,BetterDepth可以以即插即用的方式改善其他MDE模型的性能,无需额外的重新训练。
English
By training over large-scale datasets, zero-shot monocular depth estimation
(MDE) methods show robust performance in the wild but often suffer from
insufficiently precise details. Although recent diffusion-based MDE approaches
exhibit appealing detail extraction ability, they still struggle in
geometrically challenging scenes due to the difficulty of gaining robust
geometric priors from diverse datasets. To leverage the complementary merits of
both worlds, we propose BetterDepth to efficiently achieve geometrically
correct affine-invariant MDE performance while capturing fine-grained details.
Specifically, BetterDepth is a conditional diffusion-based refiner that takes
the prediction from pre-trained MDE models as depth conditioning, in which the
global depth context is well-captured, and iteratively refines details based on
the input image. For the training of such a refiner, we propose global
pre-alignment and local patch masking methods to ensure the faithfulness of
BetterDepth to depth conditioning while learning to capture fine-grained scene
details. By efficient training on small-scale synthetic datasets, BetterDepth
achieves state-of-the-art zero-shot MDE performance on diverse public datasets
and in-the-wild scenes. Moreover, BetterDepth can improve the performance of
other MDE models in a plug-and-play manner without additional re-training.