SceneGen：单图像三维场景生成的一步前馈算法

摘要

三维内容生成技术近来因其在虚拟现实/增强现实（VR/AR）及具身智能领域的应用而备受研究关注。本研究中，我们致力于解决从单一场景图像中合成多个三维资产的挑战性任务。具体而言，我们的贡献体现在四个方面：（一）我们提出了SceneGen，一个创新框架，该框架以场景图像及相应的物体掩码作为输入，同时生成具备几何形状与纹理的多个三维资产。值得注意的是，SceneGen无需进行优化或资产检索即可运行；（二）我们引入了一种新颖的特征聚合模块，该模块在特征提取过程中整合了来自视觉与几何编码器的局部与全局场景信息。结合位置预测头，这一设计使得在一次前向传播中即可生成三维资产及其相对空间位置；（三）我们展示了SceneGen对多图像输入场景的直接扩展能力。尽管仅基于单图像输入进行训练，我们的架构设计却能在多图像输入下实现更优的生成效果；（四）广泛的定量与定性评估验证了我们方法的高效性与强大的生成能力。我们相信，这一范式为高质量三维内容生成提供了新颖的解决方案，有望推动其在下游任务中的实际应用。代码与模型将公开发布于：https://mengmouxu.github.io/SceneGen。

English

3D content generation has recently attracted significant research interest due to its applications in VR/AR and embodied AI. In this work, we address the challenging task of synthesizing multiple 3D assets within a single scene image. Concretely, our contributions are fourfold: (i) we present SceneGen, a novel framework that takes a scene image and corresponding object masks as input, simultaneously producing multiple 3D assets with geometry and texture. Notably, SceneGen operates with no need for optimization or asset retrieval; (ii) we introduce a novel feature aggregation module that integrates local and global scene information from visual and geometric encoders within the feature extraction module. Coupled with a position head, this enables the generation of 3D assets and their relative spatial positions in a single feedforward pass; (iii) we demonstrate SceneGen's direct extensibility to multi-image input scenarios. Despite being trained solely on single-image inputs, our architectural design enables improved generation performance with multi-image inputs; and (iv) extensive quantitative and qualitative evaluations confirm the efficiency and robust generation abilities of our approach. We believe this paradigm offers a novel solution for high-quality 3D content generation, potentially advancing its practical applications in downstream tasks. The code and model will be publicly available at: https://mengmouxu.github.io/SceneGen.

SceneGen：单图像三维场景生成的一步前馈算法

SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass

摘要

Support