SceneGen: 단일 이미지에서 단일 피드포워드 패스로 3D 장면 생성

초록

3D 콘텐츠 생성은 VR/AR 및 구체화된 AI(Embodied AI) 분야에서의 응용 가능성으로 인해 최근 상당한 연구 관심을 끌고 있습니다. 본 연구에서는 단일 장면 이미지 내에서 여러 3D 자산을 합성하는 어려운 과제를 다룹니다. 구체적으로, 우리의 기여는 다음과 같이 네 가지로 요약됩니다: (i) 장면 이미지와 해당 객체 마스크를 입력으로 받아 여러 3D 자산을 기하학적 구조와 텍스처와 함께 동시에 생성하는 새로운 프레임워크인 SceneGen을 제안합니다. 특히, SceneGen은 최적화나 자산 검색 없이도 작동합니다; (ii) 특징 추출 모듈 내에서 시각적 및 기하학적 인코더로부터 지역적 및 전역적 장면 정보를 통합하는 새로운 특징 집계 모듈을 소개합니다. 이는 위치 헤드와 결합되어 단일 순방향 전달로 3D 자산과 그들의 상대적 공간 위치를 생성할 수 있게 합니다; (iii) SceneGen이 다중 이미지 입력 시나리오로 직접 확장 가능함을 입증합니다. 단일 이미지 입력으로만 훈련되었음에도 불구하고, 우리의 아키텍처 설계는 다중 이미지 입력 시 향상된 생성 성능을 가능하게 합니다; 그리고 (iv) 광범위한 정량적 및 정성적 평가를 통해 우리 접근법의 효율성과 강력한 생성 능력을 확인합니다. 우리는 이 패러다임이 고품질 3D 콘텐츠 생성을 위한 새로운 솔루션을 제공하며, 하위 작업에서의 실용적 응용을 발전시킬 잠재력이 있다고 믿습니다. 코드와 모델은 https://mengmouxu.github.io/SceneGen에서 공개될 예정입니다.

English

3D content generation has recently attracted significant research interest due to its applications in VR/AR and embodied AI. In this work, we address the challenging task of synthesizing multiple 3D assets within a single scene image. Concretely, our contributions are fourfold: (i) we present SceneGen, a novel framework that takes a scene image and corresponding object masks as input, simultaneously producing multiple 3D assets with geometry and texture. Notably, SceneGen operates with no need for optimization or asset retrieval; (ii) we introduce a novel feature aggregation module that integrates local and global scene information from visual and geometric encoders within the feature extraction module. Coupled with a position head, this enables the generation of 3D assets and their relative spatial positions in a single feedforward pass; (iii) we demonstrate SceneGen's direct extensibility to multi-image input scenarios. Despite being trained solely on single-image inputs, our architectural design enables improved generation performance with multi-image inputs; and (iv) extensive quantitative and qualitative evaluations confirm the efficiency and robust generation abilities of our approach. We believe this paradigm offers a novel solution for high-quality 3D content generation, potentially advancing its practical applications in downstream tasks. The code and model will be publicly available at: https://mengmouxu.github.io/SceneGen.

SceneGen: 단일 이미지에서 단일 피드포워드 패스로 3D 장면 생성

SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass

초록

Support