HexaGen3D: 안정적이고 빠르며 다양한 텍스트-3D 생성을 위한 StableDiffusion의 마지막 한 걸음

초록

생성 모델링 분야의 최근 놀라운 발전에도 불구하고, 텍스트 프롬프트로부터 고품질 3D 자산을 효율적으로 생성하는 것은 여전히 어려운 과제로 남아 있습니다. 주요 도전 과제 중 하나는 데이터 부족 문제입니다: 가장 방대한 3D 데이터셋도 수백만 개의 자산을 포함하는 반면, 2D 데이터셋은 수십억 개의 텍스트-이미지 쌍을 포함하고 있습니다. 이를 해결하기 위해, 우리는 대규모로 사전 학습된 2D 확산 모델의 힘을 활용하는 새로운 접근 방식을 제안합니다. 보다 구체적으로, 우리의 접근 방식인 HexaGen3D는 사전 학습된 텍스트-이미지 모델을 미세 조정하여 6개의 직교 투영과 해당 잠재 삼면체를 동시에 예측하도록 합니다. 그런 다음 이러한 잠재 변수를 디코딩하여 텍스처가 적용된 메시를 생성합니다. HexaGen3D는 샘플별 최적화가 필요하지 않으며, 텍스트 프롬프트로부터 고품질이고 다양한 객체를 7초 내에 추론할 수 있어, 기존 접근 방식과 비교했을 때 훨씬 더 나은 품질-지연 시간 트레이드오프를 제공합니다. 또한, HexaGen3D는 새로운 객체나 구성을 강력하게 일반화하는 능력을 보여줍니다.

English

Despite the latest remarkable advances in generative modeling, efficient generation of high-quality 3D assets from textual prompts remains a difficult task. A key challenge lies in data scarcity: the most extensive 3D datasets encompass merely millions of assets, while their 2D counterparts contain billions of text-image pairs. To address this, we propose a novel approach which harnesses the power of large, pretrained 2D diffusion models. More specifically, our approach, HexaGen3D, fine-tunes a pretrained text-to-image model to jointly predict 6 orthographic projections and the corresponding latent triplane. We then decode these latents to generate a textured mesh. HexaGen3D does not require per-sample optimization, and can infer high-quality and diverse objects from textual prompts in 7 seconds, offering significantly better quality-to-latency trade-offs when comparing to existing approaches. Furthermore, HexaGen3D demonstrates strong generalization to new objects or compositions.

HexaGen3D: 안정적이고 빠르며 다양한 텍스트-3D 생성을 위한 StableDiffusion의 마지막 한 걸음

HexaGen3D: StableDiffusion is just one step away from Fast and Diverse Text-to-3D Generation

초록

Support