ワンワールド：3D統一表現オートエンコーダによるシーン生成の制御

要旨

既存の拡散モデルベースの3Dシーン生成手法は、主に2D画像/動画の潜在空間で動作するため、視点間の見た目と幾何学的一貫性の維持が本質的に困難である。この課題を解決するため、我々は一貫性のある3D表現空間内で直接拡散を実行するフレームワーク「OneWorld」を提案する。本手法の中核となるのは、3D統一表現オートエンコーダ（3D-URAE）である。3D-URAEは学習済み3D基盤モデルを活用し、幾何中心の性質を、外観の注入と意味情報の蒸留によって強化し、統一された3D潜在空間を構築する。さらに、視点間の構造的整合性を明示的に強化するためのトークンレベルCross-View-Correspondence（CVC）一貫性損失を導入し、訓練と推論の曝露バイアスを軽減し、ドリフトした表現と元の表現を混合することで堅牢な3D多様体を形成するManifold-Drift Forcing（MDF）を提案する。包括的な実験により、OneWorldが最先端の2Dベース手法と比較して、優れた視点間一貫性を備えた高品質な3Dシーンを生成することを実証する。コードはhttps://github.com/SensenGao/OneWorld で公開予定である。

English

Existing diffusion-based 3D scene generation methods primarily operate in 2D image/video latent spaces, which makes maintaining cross-view appearance and geometric consistency inherently challenging. To bridge this gap, we present OneWorld, a framework that performs diffusion directly within a coherent 3D representation space. Central to our approach is the 3D Unified Representation Autoencoder (3D-URAE); it leverages pretrained 3D foundation models and augments their geometry-centric nature by injecting appearance and distilling semantics into a unified 3D latent space. Furthermore, we introduce token-level Cross-View-Correspondence (CVC) consistency loss to explicitly enforce structural alignment across views, and propose Manifold-Drift Forcing (MDF) to mitigate train-inference exposure bias and shape a robust 3D manifold by mixing drifted and original representations. Comprehensive experiments demonstrate that OneWorld generates high-quality 3D scenes with superior cross-view consistency compared to state-of-the-art 2D-based methods. Our code will be available at https://github.com/SensenGao/OneWorld.

ワンワールド：3D統一表現オートエンコーダによるシーン生成の制御

OneWorld: Taming Scene Generation with 3D Unified Representation Autoencoder

要旨

Support