LGM: 고해상도 3D 콘텐츠 생성을 위한 대규모 다중 뷰 가우시안 모델

초록

3D 콘텐츠 생성은 품질과 속도 측면에서 상당한 진전을 이루어 왔습니다. 현재의 피드포워드 모델은 몇 초 만에 3D 객체를 생성할 수 있지만, 그 해상도는 학습 과정에서 요구되는 집약적인 계산으로 인해 제한적입니다. 본 논문에서는 텍스트 프롬프트나 단일 뷰 이미지로부터 고해상도 3D 모델을 생성하기 위해 설계된 새로운 프레임워크인 Large Multi-View Gaussian Model(LGM)을 소개합니다. 우리의 주요 통찰은 두 가지입니다: 1) 3D 표현: 우리는 효율적이면서도 강력한 표현으로 다중 뷰 가우시안 특징을 제안하며, 이를 융합하여 미분 가능한 렌더링을 수행할 수 있습니다. 2) 3D 백본: 우리는 다중 뷰 이미지에서 작동하는 고처리량 백본으로 비대칭 U-Net을 제시하며, 이는 다중 뷰 확산 모델을 활용하여 텍스트나 단일 뷰 이미지 입력으로부터 생성될 수 있습니다. 광범위한 실험을 통해 우리의 접근 방식이 높은 충실도와 효율성을 보여줌을 입증했습니다. 특히, 우리는 3D 객체를 5초 이내에 생성하는 빠른 속도를 유지하면서 학습 해상도를 512로 향상시켜 고해상도 3D 콘텐츠 생성을 달성했습니다.

English

3D content creation has achieved significant progress in terms of both quality and speed. Although current feed-forward models can produce 3D objects in seconds, their resolution is constrained by the intensive computation required during training. In this paper, we introduce Large Multi-View Gaussian Model (LGM), a novel framework designed to generate high-resolution 3D models from text prompts or single-view images. Our key insights are two-fold: 1) 3D Representation: We propose multi-view Gaussian features as an efficient yet powerful representation, which can then be fused together for differentiable rendering. 2) 3D Backbone: We present an asymmetric U-Net as a high-throughput backbone operating on multi-view images, which can be produced from text or single-view image input by leveraging multi-view diffusion models. Extensive experiments demonstrate the high fidelity and efficiency of our approach. Notably, we maintain the fast speed to generate 3D objects within 5 seconds while boosting the training resolution to 512, thereby achieving high-resolution 3D content generation.

LGM: 고해상도 3D 콘텐츠 생성을 위한 대규모 다중 뷰 가우시안 모델

LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation

초록

Support