SceneGen：單圖像三維場景生成的一步前饋傳遞

摘要

3D內容生成近期因其在VR/AR及具身AI中的應用而引起了廣泛的研究興趣。本研究致力於解決在單一場景圖像中合成多個3D資產的挑戰性任務。具體而言，我們的貢獻包括以下四點：(i) 我們提出了SceneGen，這是一個新穎的框架，它以場景圖像及相應的物體遮罩作為輸入，同時生成具有幾何形狀和紋理的多個3D資產。值得注意的是，SceneGen無需進行優化或資產檢索即可運行；(ii) 我們引入了一種新穎的特徵聚合模塊，該模塊在特徵提取模塊中整合了來自視覺和幾何編碼器的局部與全局場景信息。結合位置頭，這使得我們能夠在單次前饋過程中生成3D資產及其相對空間位置；(iii) 我們展示了SceneGen在多圖像輸入場景中的直接可擴展性。儘管僅在單圖像輸入上進行訓練，我們的架構設計使得在多圖像輸入下仍能提升生成性能；(iv) 大量的定量與定性評估證實了我們方法的高效性和強大的生成能力。我們相信這一範式為高質量3D內容生成提供了一種新穎的解決方案，有望推動其在下游任務中的實際應用。代碼和模型將公開於：https://mengmouxu.github.io/SceneGen。

English

3D content generation has recently attracted significant research interest due to its applications in VR/AR and embodied AI. In this work, we address the challenging task of synthesizing multiple 3D assets within a single scene image. Concretely, our contributions are fourfold: (i) we present SceneGen, a novel framework that takes a scene image and corresponding object masks as input, simultaneously producing multiple 3D assets with geometry and texture. Notably, SceneGen operates with no need for optimization or asset retrieval; (ii) we introduce a novel feature aggregation module that integrates local and global scene information from visual and geometric encoders within the feature extraction module. Coupled with a position head, this enables the generation of 3D assets and their relative spatial positions in a single feedforward pass; (iii) we demonstrate SceneGen's direct extensibility to multi-image input scenarios. Despite being trained solely on single-image inputs, our architectural design enables improved generation performance with multi-image inputs; and (iv) extensive quantitative and qualitative evaluations confirm the efficiency and robust generation abilities of our approach. We believe this paradigm offers a novel solution for high-quality 3D content generation, potentially advancing its practical applications in downstream tasks. The code and model will be publicly available at: https://mengmouxu.github.io/SceneGen.

SceneGen：單圖像三維場景生成的一步前饋傳遞

SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass

摘要

Support