OneWorld：基于3D统一表征自编码器的场景生成驯服术

摘要

现有基于扩散模型的3D场景生成方法主要在2D图像/视频隐空间中进行操作，这导致保持跨视角外观与几何一致性存在固有挑战。为弥补这一缺陷，我们提出OneWorld框架，该框架在连贯的3D表征空间内直接执行扩散过程。我们方法的核心是3D统一表征自动编码器（3D-URAE），它利用预训练的3D基础模型，通过将外观信息注入并提炼语义特征到统一3D隐空间，增强其以几何为中心的特性。此外，我们引入令牌级跨视角对应（CVC）一致性损失来显式加强视角间的结构对齐，并提出流形漂移强制（MDF）方法，通过混合漂移表征与原始表征来缓解训练-推理曝光偏差，从而构建稳健的3D流形。综合实验表明，与当前最先进的基于2D的方法相比，OneWorld能生成具有更优跨视角一致性的高质量3D场景。代码将在https://github.com/SensenGao/OneWorld开源。

English

Existing diffusion-based 3D scene generation methods primarily operate in 2D image/video latent spaces, which makes maintaining cross-view appearance and geometric consistency inherently challenging. To bridge this gap, we present OneWorld, a framework that performs diffusion directly within a coherent 3D representation space. Central to our approach is the 3D Unified Representation Autoencoder (3D-URAE); it leverages pretrained 3D foundation models and augments their geometry-centric nature by injecting appearance and distilling semantics into a unified 3D latent space. Furthermore, we introduce token-level Cross-View-Correspondence (CVC) consistency loss to explicitly enforce structural alignment across views, and propose Manifold-Drift Forcing (MDF) to mitigate train-inference exposure bias and shape a robust 3D manifold by mixing drifted and original representations. Comprehensive experiments demonstrate that OneWorld generates high-quality 3D scenes with superior cross-view consistency compared to state-of-the-art 2D-based methods. Our code will be available at https://github.com/SensenGao/OneWorld.

OneWorld：基于3D统一表征自编码器的场景生成驯服术

OneWorld: Taming Scene Generation with 3D Unified Representation Autoencoder

摘要

Support