Extend3D：城域级三维场景生成技术

摘要

本文提出Extend3D——一种基于物体中心三维生成模型的单图像三维场景生成免训练流程。针对物体中心模型固定尺寸潜在空间在表现广阔场景时的局限性，我们沿x轴与y轴方向扩展了潜在空间。通过将扩展后的潜在空间划分为重叠区块，我们对每个区块应用物体中心三维生成模型，并在每个时间步进行耦合。由于基于图像条件的分块三维生成要求图像与潜在区块严格空间对齐，我们采用单目深度估计器获取的点云先验初始化场景，并通过SDEdit迭代优化被遮挡区域。研究发现，将三维结构的不完整性视为噪声并在三维优化过程中进行处理，可实现通过"欠去噪"概念的三维补全。此外，为解决物体中心模型在子场景生成中的次优问题，我们在去噪过程中对扩展潜在空间进行优化，确保去噪轨迹与子场景动态保持一致。为此，我们引入了三维感知优化目标以提升几何结构与纹理保真度。实验表明，通过用户偏好评估与定量实验验证，本方法相较现有方法能产生更优结果。

English

In this paper, we propose Extend3D, a training-free pipeline for 3D scene generation from a single image, built upon an object-centric 3D generative model. To overcome the limitations of fixed-size latent spaces in object-centric models for representing wide scenes, we extend the latent space in the x and y directions. Then, by dividing the extended latent space into overlapping patches, we apply the object-centric 3D generative model to each patch and couple them at each time step. Since patch-wise 3D generation with image conditioning requires strict spatial alignment between image and latent patches, we initialize the scene using a point cloud prior from a monocular depth estimator and iteratively refine occluded regions through SDEdit. We discovered that treating the incompleteness of 3D structure as noise during 3D refinement enables 3D completion via a concept, which we term under-noising. Furthermore, to address the sub-optimality of object-centric models for sub-scene generation, we optimize the extended latent during denoising, ensuring that the denoising trajectories remain consistent with the sub-scene dynamics. To this end, we introduce 3D-aware optimization objectives for improved geometric structure and texture fidelity. We demonstrate that our method yields better results than prior methods, as evidenced by human preference and quantitative experiments.