Extend3D:城镇级三维场景生成技术
Extend3D: Town-Scale 3D Generation
March 31, 2026
作者: Seungwoo Yoon, Jinmo Kim, Jaesik Park
cs.AI
摘要
本文提出Extend3D——一种基于物体中心三维生成模型的单图三维场景生成免训练流程。针对物体中心模型在表征广阔场景时存在的固定尺寸隐空间局限,我们通过沿x轴与y轴扩展隐空间实现突破。通过将扩展后的隐空间划分为重叠区块,我们对每个区块应用物体中心三维生成模型,并在每个时间步进行耦合。由于基于图像条件的区块化三维生成要求图像与隐空间区块严格对齐,我们采用单目深度估计器获取点云先验来初始化场景,并通过SDEdit迭代优化遮挡区域。研究发现,将三维结构的不完整性视为三维优化过程中的噪声,可通过"欠去噪"概念实现三维补全。此外,为解决物体中心模型在子场景生成中的次优问题,我们在去噪过程中对扩展隐空间进行优化,确保去噪轨迹与子场景动态保持一致。为此,我们引入了三维感知优化目标以提升几何结构与纹理保真度。实验表明,通过用户偏好评估与定量实验验证,本方法优于现有技术方案。
English
In this paper, we propose Extend3D, a training-free pipeline for 3D scene generation from a single image, built upon an object-centric 3D generative model. To overcome the limitations of fixed-size latent spaces in object-centric models for representing wide scenes, we extend the latent space in the x and y directions. Then, by dividing the extended latent space into overlapping patches, we apply the object-centric 3D generative model to each patch and couple them at each time step. Since patch-wise 3D generation with image conditioning requires strict spatial alignment between image and latent patches, we initialize the scene using a point cloud prior from a monocular depth estimator and iteratively refine occluded regions through SDEdit. We discovered that treating the incompleteness of 3D structure as noise during 3D refinement enables 3D completion via a concept, which we term under-noising. Furthermore, to address the sub-optimality of object-centric models for sub-scene generation, we optimize the extended latent during denoising, ensuring that the denoising trajectories remain consistent with the sub-scene dynamics. To this end, we introduce 3D-aware optimization objectives for improved geometric structure and texture fidelity. We demonstrate that our method yields better results than prior methods, as evidenced by human preference and quantitative experiments.