표면 정렬 가우시안 스플래팅을 통한 제어 가능한 텍스트-3D 생성

초록

텍스트-3D 및 이미지-3D 생성 작업은 상당한 관심을 받아왔지만, 이 둘 사이에서 중요한데도 불구하고 충분히 탐구되지 않은 분야는 제어 가능한 텍스트-3D 생성입니다. 본 연구에서는 이 작업에 주목합니다. 이를 해결하기 위해, 1) 우리는 Multi-view ControlNet(MVControl)이라는 새로운 신경망 아키텍처를 소개합니다. 이 아키텍처는 기존에 사전 학습된 멀티뷰 확산 모델을 개선하기 위해 에지, 깊이, 노멀, 스크리블 맵과 같은 추가 입력 조건을 통합하도록 설계되었습니다. 우리의 혁신은 입력 조건 이미지와 카메라 포즈로부터 계산된 로컬 및 글로벌 임베딩을 사용하여 기본 확산 모델을 제어하는 조건화 모듈의 도입에 있습니다. 학습이 완료되면, MVControl은 최적화 기반 3D 생성을 위한 3D 확산 가이던스를 제공할 수 있습니다. 그리고 2) 우리는 최근의 대규모 재구성 모델과 점수 증류 알고리즘의 이점을 활용하는 효율적인 다단계 3D 생성 파이프라인을 제안합니다. MVControl 아키텍처를 기반으로, 우리는 최적화 과정을 지시하기 위해 독특한 하이브리드 확산 가이던스 방법을 사용합니다. 효율성을 추구하기 위해, 우리는 일반적으로 사용되는 암묵적 표현 대신 3D 가우시안을 표현으로 채택합니다. 또한, 우리는 가우시안을 메시 삼각형 면에 결합하는 하이브리드 표현인 SuGaR의 사용을 선구적으로 도입합니다. 이 접근법은 3D 가우시안에서의 불량한 기하학적 문제를 완화하고 메시 상에서 세밀한 기하학적 구조를 직접 조각할 수 있게 합니다. 광범위한 실험을 통해 우리의 방법이 강력한 일반화를 달성하고 고품질 3D 콘텐츠의 제어 가능한 생성을 가능하게 함을 입증합니다.

English

While text-to-3D and image-to-3D generation tasks have received considerable attention, one important but under-explored field between them is controllable text-to-3D generation, which we mainly focus on in this work. To address this task, 1) we introduce Multi-view ControlNet (MVControl), a novel neural network architecture designed to enhance existing pre-trained multi-view diffusion models by integrating additional input conditions, such as edge, depth, normal, and scribble maps. Our innovation lies in the introduction of a conditioning module that controls the base diffusion model using both local and global embeddings, which are computed from the input condition images and camera poses. Once trained, MVControl is able to offer 3D diffusion guidance for optimization-based 3D generation. And, 2) we propose an efficient multi-stage 3D generation pipeline that leverages the benefits of recent large reconstruction models and score distillation algorithm. Building upon our MVControl architecture, we employ a unique hybrid diffusion guidance method to direct the optimization process. In pursuit of efficiency, we adopt 3D Gaussians as our representation instead of the commonly used implicit representations. We also pioneer the use of SuGaR, a hybrid representation that binds Gaussians to mesh triangle faces. This approach alleviates the issue of poor geometry in 3D Gaussians and enables the direct sculpting of fine-grained geometry on the mesh. Extensive experiments demonstrate that our method achieves robust generalization and enables the controllable generation of high-quality 3D content.

표면 정렬 가우시안 스플래팅을 통한 제어 가능한 텍스트-3D 생성

Controllable Text-to-3D Generation via Surface-Aligned Gaussian Splatting

초록

Support