SPATIALGEN：佈局引導的三維室內場景生成

摘要

創建高保真度的室內環境三維模型對於設計、虛擬現實及機器人技術等應用至關重要。然而，手動三維建模仍耗時且勞動密集。儘管生成式人工智慧的最新進展已實現了自動化場景合成，現有方法在平衡視覺質量、多樣性、語義一致性及用戶控制方面往往面臨挑戰。一個主要瓶頸是缺乏針對此任務的大規模高質量數據集。為填補這一空白，我們引入了一個全面的合成數據集，包含12,328個結構化註釋場景、57,440個房間及470萬張逼真的二維渲染圖。利用此數據集，我們提出了SpatialGen，一種新穎的多視角多模態擴散模型，能夠生成真實且語義一致的三維室內場景。給定一個三維布局及參考圖像（源自文本提示），我們的模型從任意視角合成外觀（彩色圖像）、幾何（場景坐標圖）及語義（語義分割圖），同時保持跨模態的空間一致性。在實驗中，SpatialGen持續生成優於先前方法的結果。我們將開源數據與模型，以賦能社群並推動室內場景理解與生成領域的發展。

English

Creating high-fidelity 3D models of indoor environments is essential for applications in design, virtual reality, and robotics. However, manual 3D modeling remains time-consuming and labor-intensive. While recent advances in generative AI have enabled automated scene synthesis, existing methods often face challenges in balancing visual quality, diversity, semantic consistency, and user control. A major bottleneck is the lack of a large-scale, high-quality dataset tailored to this task. To address this gap, we introduce a comprehensive synthetic dataset, featuring 12,328 structured annotated scenes with 57,440 rooms, and 4.7M photorealistic 2D renderings. Leveraging this dataset, we present SpatialGen, a novel multi-view multi-modal diffusion model that generates realistic and semantically consistent 3D indoor scenes. Given a 3D layout and a reference image (derived from a text prompt), our model synthesizes appearance (color image), geometry (scene coordinate map), and semantic (semantic segmentation map) from arbitrary viewpoints, while preserving spatial consistency across modalities. SpatialGen consistently generates superior results to previous methods in our experiments. We are open-sourcing our data and models to empower the community and advance the field of indoor scene understanding and generation.

SPATIALGEN：佈局引導的三維室內場景生成

SPATIALGEN: Layout-guided 3D Indoor Scene Generation

摘要

Support