OneWorld: 3D 통합 표현 오토인코더로 장면 생성 길들이기

초록

기존의 확산 기반 3D 장면 생성 방법은 주로 2D 이미지/비디오 잠재 공간에서 동작하여 시점 간 외관 및 기하학적 일관성을 유지하는 것이 본질적으로 어려웠습니다. 이러한 격차를 해소하기 위해 우리는 일관된 3D 표현 공간 내에서 직접 확산을 수행하는 프레임워크인 OneWorld를 제시합니다. 우리 접근법의 핵심은 3D 통합 표현 오토인코더(3D-URAE)로, 사전 학습된 3D 기초 모델의 기하학 중심 특성을 활용하면서 외관 정보를 주입하고 의미를 추출하여 통합 3D 잠재 공간을 구성합니다. 더 나아가, 토큰 수준의 교차 시점 일관성(CVC) 손실을 도입하여 시점 간 구조적 정렬을 명시적으로 강화하고, Manifold-Drift Forcing(MDF)을 제안하여 훈련-추론 노출 편향을 완화하며 표류된 표현과 원본 표현을 혼합하여 강력한 3D 매니폴드를 형성합니다. 포괄적인 실험을 통해 OneWorld가 최신 2D 기반 방법론 대비 우수한 시점 간 일관성으로 고품질의 3D 장면을 생성함을 입증합니다. 우리의 코드는 https://github.com/SensenGao/OneWorld에서 공개될 예정입니다.

English

Existing diffusion-based 3D scene generation methods primarily operate in 2D image/video latent spaces, which makes maintaining cross-view appearance and geometric consistency inherently challenging. To bridge this gap, we present OneWorld, a framework that performs diffusion directly within a coherent 3D representation space. Central to our approach is the 3D Unified Representation Autoencoder (3D-URAE); it leverages pretrained 3D foundation models and augments their geometry-centric nature by injecting appearance and distilling semantics into a unified 3D latent space. Furthermore, we introduce token-level Cross-View-Correspondence (CVC) consistency loss to explicitly enforce structural alignment across views, and propose Manifold-Drift Forcing (MDF) to mitigate train-inference exposure bias and shape a robust 3D manifold by mixing drifted and original representations. Comprehensive experiments demonstrate that OneWorld generates high-quality 3D scenes with superior cross-view consistency compared to state-of-the-art 2D-based methods. Our code will be available at https://github.com/SensenGao/OneWorld.

OneWorld: 3D 통합 표현 오토인코더로 장면 생성 길들이기

OneWorld: Taming Scene Generation with 3D Unified Representation Autoencoder

초록

Support