X-Dreamer: 텍스트-2D와 텍스트-3D 생성 간의 도메인 격차를 해소하여 고품질 3D 콘텐츠 제작

초록

최근 사전 학습된 2D 디퓨전 모델의 발전에 힘입어 자동 텍스트-3D 콘텐츠 생성이 상당한 진전을 이루었습니다. 기존의 텍스트-3D 방법은 일반적으로 사전 학습된 2D 디퓨전 모델에 의해 평가된 대로, 렌더링된 이미지가 주어진 텍스트와 잘 일치하도록 3D 표현을 최적화합니다. 그러나 2D 이미지와 3D 자산 사이에는 상당한 도메인 간극이 존재하며, 이는 주로 카메라 관련 속성의 변동과 전경 객체만이 존재하는 데 기인합니다. 결과적으로, 2D 디퓨전 모델을 직접 3D 표현 최적화에 사용하는 것은 최적이 아닌 결과를 초래할 수 있습니다. 이 문제를 해결하기 위해, 우리는 텍스트-2D와 텍스트-3D 합성 간의 간극을 효과적으로 메우는 고품질 텍스트-3D 콘텐츠 생성 방법인 X-Dreamer를 제안합니다. X-Dreamer의 핵심 구성 요소는 두 가지 혁신적인 설계입니다: 카메라 가이드 저순위 적응(Camera-Guided Low-Rank Adaptation, CG-LoRA)과 주의 마스크 정렬(Attention-Mask Alignment, AMA) 손실입니다. CG-LoRA는 학습 가능한 매개변수에 대해 카메라 의존적 생성을 사용하여 사전 학습된 디퓨전 모델에 카메라 정보를 동적으로 통합합니다. 이 통합은 생성된 3D 자산과 카메라의 시각 간의 정렬을 강화합니다. AMA 손실은 3D 객체의 이진 마스크를 사용하여 사전 학습된 디퓨전 모델의 주의 맵을 안내하며, 전경 객체의 생성을 우선시합니다. 이 모듈은 모델이 정확하고 상세한 전경 객체를 생성하는 데 집중하도록 보장합니다. 광범위한 평가를 통해 우리가 제안한 방법이 기존의 텍스트-3D 접근법에 비해 효과적임을 입증했습니다. 우리의 프로젝트 웹페이지: https://xmuxiaoma666.github.io/Projects/X-Dreamer.

English

In recent times, automatic text-to-3D content creation has made significant progress, driven by the development of pretrained 2D diffusion models. Existing text-to-3D methods typically optimize the 3D representation to ensure that the rendered image aligns well with the given text, as evaluated by the pretrained 2D diffusion model. Nevertheless, a substantial domain gap exists between 2D images and 3D assets, primarily attributed to variations in camera-related attributes and the exclusive presence of foreground objects. Consequently, employing 2D diffusion models directly for optimizing 3D representations may lead to suboptimal outcomes. To address this issue, we present X-Dreamer, a novel approach for high-quality text-to-3D content creation that effectively bridges the gap between text-to-2D and text-to-3D synthesis. The key components of X-Dreamer are two innovative designs: Camera-Guided Low-Rank Adaptation (CG-LoRA) and Attention-Mask Alignment (AMA) Loss. CG-LoRA dynamically incorporates camera information into the pretrained diffusion models by employing camera-dependent generation for trainable parameters. This integration enhances the alignment between the generated 3D assets and the camera's perspective. AMA loss guides the attention map of the pretrained diffusion model using the binary mask of the 3D object, prioritizing the creation of the foreground object. This module ensures that the model focuses on generating accurate and detailed foreground objects. Extensive evaluations demonstrate the effectiveness of our proposed method compared to existing text-to-3D approaches. Our project webpage: https://xmuxiaoma666.github.io/Projects/X-Dreamer .

X-Dreamer: 텍스트-2D와 텍스트-3D 생성 간의 도메인 격차를 해소하여 고품질 3D 콘텐츠 제작

X-Dreamer: Creating High-quality 3D Content by Bridging the Domain Gap Between Text-to-2D and Text-to-3D Generation

초록

Support