隐形缝合：利用深度修复生成平滑的3D场景

摘要

3D场景生成迅速成为一个具有挑战性的新研究方向，得益于2D生成扩散模型的持续改进。在这一领域的大部分先前工作通过迭代地将新生成的帧与现有几何图形拼接来生成场景。这些工作通常依赖于预训练的单眼深度估计器将生成的图像提升到3D，并将其与现有场景表示融合。然后，这些方法通常通过文本度量来评估，衡量生成图像与给定文本提示之间的相似性。在这项工作中，我们对3D场景生成领域做出了两项基本贡献。首先，我们指出使用单眼深度估计模型将图像提升到3D是次优的，因为它忽略了现有场景的几何形状。因此，我们引入了一种新颖的深度完成模型，通过教师蒸馏和自我训练来训练学习3D融合过程，从而提高了场景的几何一致性。其次，我们引入了一种基于地面真实几何的场景生成方法的新基准方案，从而衡量场景结构的质量。

English

3D scene generation has quickly become a challenging new research direction, fueled by consistent improvements of 2D generative diffusion models. Most prior work in this area generates scenes by iteratively stitching newly generated frames with existing geometry. These works often depend on pre-trained monocular depth estimators to lift the generated images into 3D, fusing them with the existing scene representation. These approaches are then often evaluated via a text metric, measuring the similarity between the generated images and a given text prompt. In this work, we make two fundamental contributions to the field of 3D scene generation. First, we note that lifting images to 3D with a monocular depth estimation model is suboptimal as it ignores the geometry of the existing scene. We thus introduce a novel depth completion model, trained via teacher distillation and self-training to learn the 3D fusion process, resulting in improved geometric coherence of the scene. Second, we introduce a new benchmarking scheme for scene generation methods that is based on ground truth geometry, and thus measures the quality of the structure of the scene.

隐形缝合：利用深度修复生成平滑的3D场景

Invisible Stitch: Generating Smooth 3D Scenes with Depth Inpainting

摘要

Support