Instant3D: 희소 뷰 생성과 대규모 재구성 모델을 통한 빠른 텍스트-3D 변환

초록

텍스트에서 3D를 생성하는 디퓨전 모델은 최근 몇 년 동안 놀라운 발전을 이루었습니다. 그러나 기존 방법들은 느린 추론 속도, 낮은 다양성, 그리고 야누스 문제를 겪는 점수 증류 기반 최적화에 의존하거나, 3D 학습 데이터의 부족으로 인해 저품질의 결과를 생성하는 피드포워드 방식을 사용합니다. 본 논문에서는 텍스트 프롬프트로부터 고품질이고 다양한 3D 자산을 피드포워드 방식으로 생성하는 새로운 방법인 Instant3D를 제안합니다. 우리는 두 단계의 패러다임을 채택하여, 먼저 미세 조정된 2D 텍스트-이미지 디퓨전 모델을 사용해 텍스트로부터 일관된 구조를 가진 네 개의 희소 뷰를 한 번에 생성하고, 이후 새로운 트랜스포머 기반 희소 뷰 재구성기를 통해 생성된 이미지에서 직접 NeRF를 회귀합니다. 광범위한 실험을 통해 우리의 방법이 20초 이내에 고품질, 다양성, 그리고 야누스 문제가 없는 3D 자산을 생성할 수 있음을 입증했습니다. 이는 1~10시간이 소요되는 기존의 최적화 기반 방법보다 두 배 이상 빠른 속도입니다. 프로젝트 웹페이지: https://jiahao.ai/instant3d/.

English

Text-to-3D with diffusion models have achieved remarkable progress in recent years. However, existing methods either rely on score distillation-based optimization which suffer from slow inference, low diversity and Janus problems, or are feed-forward methods that generate low quality results due to the scarcity of 3D training data. In this paper, we propose Instant3D, a novel method that generates high-quality and diverse 3D assets from text prompts in a feed-forward manner. We adopt a two-stage paradigm, which first generates a sparse set of four structured and consistent views from text in one shot with a fine-tuned 2D text-to-image diffusion model, and then directly regresses the NeRF from the generated images with a novel transformer-based sparse-view reconstructor. Through extensive experiments, we demonstrate that our method can generate high-quality, diverse and Janus-free 3D assets within 20 seconds, which is two order of magnitude faster than previous optimization-based methods that can take 1 to 10 hours. Our project webpage: https://jiahao.ai/instant3d/.

Instant3D: 희소 뷰 생성과 대규모 재구성 모델을 통한 빠른 텍스트-3D 변환

Instant3D: Fast Text-to-3D with Sparse-View Generation and Large Reconstruction Model

초록

Support