SPATIALGEN:布局引导的室内三维场景生成
SPATIALGEN: Layout-guided 3D Indoor Scene Generation
September 18, 2025
作者: Chuan Fang, Heng Li, Yixun Liang, Jia Zheng, Yongsen Mao, Yuan Liu, Rui Tang, Zihan Zhou, Ping Tan
cs.AI
摘要
构建高保真室内环境三维模型对于设计、虚拟现实和机器人应用至关重要。然而,手动三维建模仍然耗时且劳动密集。尽管生成式人工智能的最新进展已实现场景自动合成,现有方法在平衡视觉质量、多样性、语义一致性和用户控制方面仍面临挑战。一个主要瓶颈是缺乏针对此任务的大规模高质量数据集。为填补这一空白,我们引入了一个全面的合成数据集,包含12,328个结构化标注场景、57,440个房间和470万张逼真的二维渲染图像。利用这一数据集,我们提出了SpatialGen,一种新颖的多视图多模态扩散模型,能够生成真实且语义一致的三维室内场景。给定三维布局和参考图像(源自文本提示),我们的模型从任意视角合成外观(彩色图像)、几何(场景坐标图)和语义(语义分割图),同时保持跨模态的空间一致性。实验表明,SpatialGen生成的结果始终优于以往方法。我们开源了数据和模型,以赋能社区并推动室内场景理解与生成领域的发展。
English
Creating high-fidelity 3D models of indoor environments is essential for
applications in design, virtual reality, and robotics. However, manual 3D
modeling remains time-consuming and labor-intensive. While recent advances in
generative AI have enabled automated scene synthesis, existing methods often
face challenges in balancing visual quality, diversity, semantic consistency,
and user control. A major bottleneck is the lack of a large-scale, high-quality
dataset tailored to this task. To address this gap, we introduce a
comprehensive synthetic dataset, featuring 12,328 structured annotated scenes
with 57,440 rooms, and 4.7M photorealistic 2D renderings. Leveraging this
dataset, we present SpatialGen, a novel multi-view multi-modal diffusion model
that generates realistic and semantically consistent 3D indoor scenes. Given a
3D layout and a reference image (derived from a text prompt), our model
synthesizes appearance (color image), geometry (scene coordinate map), and
semantic (semantic segmentation map) from arbitrary viewpoints, while
preserving spatial consistency across modalities. SpatialGen consistently
generates superior results to previous methods in our experiments. We are
open-sourcing our data and models to empower the community and advance the
field of indoor scene understanding and generation.