ViewDiff: 텍스트-이미지 모델을 활용한 3D 일관성 이미지 생성

초록

3D 자산 생성은 최근 텍스트 기반 2D 콘텐츠 생성의 성공에 영감을 받아 엄청난 관심을 받고 있습니다. 기존의 텍스트-3D 방법들은 사전 학습된 텍스트-이미지 확산 모델을 최적화 문제에 사용하거나 합성 데이터에 대해 미세 조정하는데, 이는 종종 배경이 없는 비사실적인 3D 객체를 생성합니다. 본 논문에서는 사전 학습된 텍스트-이미지 모델을 사전 지식으로 활용하고, 실제 데이터로부터 단일 디노이징 과정에서 다중 뷰 이미지를 생성하는 방법을 학습하는 방법을 제시합니다. 구체적으로, 우리는 텍스트-이미지 모델의 기존 U-Net 네트워크의 각 블록에 3D 볼륨 렌더링 및 프레임 간 주의 계층을 통합할 것을 제안합니다. 또한, 우리는 어떤 시점에서도 더 일관된 3D 이미지를 렌더링하는 자기회귀 생성 방식을 설계합니다. 우리는 실제 객체 데이터셋에 대해 모델을 학습시키고, 다양한 고품질 형태와 질감을 가진 인스턴스를 실제 환경에서 생성하는 능력을 보여줍니다. 기존 방법과 비교하여, 우리의 방법으로 생성된 결과는 일관적이며 시각적 품질이 우수합니다(-30% FID, -37% KID).

English

3D asset generation is getting massive amounts of attention, inspired by the recent success of text-guided 2D content creation. Existing text-to-3D methods use pretrained text-to-image diffusion models in an optimization problem or fine-tune them on synthetic data, which often results in non-photorealistic 3D objects without backgrounds. In this paper, we present a method that leverages pretrained text-to-image models as a prior, and learn to generate multi-view images in a single denoising process from real-world data. Concretely, we propose to integrate 3D volume-rendering and cross-frame-attention layers into each block of the existing U-Net network of the text-to-image model. Moreover, we design an autoregressive generation that renders more 3D-consistent images at any viewpoint. We train our model on real-world datasets of objects and showcase its capabilities to generate instances with a variety of high-quality shapes and textures in authentic surroundings. Compared to the existing methods, the results generated by our method are consistent, and have favorable visual quality (-30% FID, -37% KID).

ViewDiff: 텍스트-이미지 모델을 활용한 3D 일관성 이미지 생성

ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models

초록

Support