Text2Control3D: 기하학적 가이드를 활용한 텍스트-이미지 확산 모델을 통한 제어 가능한 3D 아바타 생성

초록

ControlNet과 같은 확산 모델의 최근 발전으로 기하학적으로 제어 가능한 고품질 텍스트-이미지 생성이 가능해졌습니다. 그러나 이러한 제어 기능을 텍스트-3D 생성에 추가하는 문제는 아직 해결되지 않았습니다. 이에 대응하여, 우리는 Text2Control3D를 제안합니다. 이는 핸드헬드 카메라로 캐주얼하게 촬영된 단안 비디오를 기반으로 얼굴 표정을 제어할 수 있는 제어 가능한 텍스트-3D 아바타 생성 방법입니다. 우리의 주요 전략은 ControlNet에서 생성된 제어된 시점 인식 이미지 세트로 최적화된 Neural Radiance Fields(NeRF) 내에 3D 아바타를 구축하는 것입니다. 여기서 ControlNet의 조건 입력은 입력 비디오에서 추출된 깊이 맵입니다. 시점 인식 이미지를 생성할 때, 우리는 교차 참조 주의(cross-reference attention)를 활용하여 잘 제어된 참조 얼굴 표정과 외관을 교차 주의(cross attention)를 통해 주입합니다. 또한, 우리는 확산 모델의 가우시안 잠재 공간에 대해 저역 통과 필터링을 수행하여, 우리의 실험적 분석에서 관찰된 시점에 무관한 텍스처 문제를 완화합니다. 이 문제는 시점 인식 이미지가 3D에서는 이해할 수 없는 동일한 픽셀 위치에 동일한 텍스처를 포함하는 현상입니다. 마지막으로, 시점 인식 이미지이지만 기하학적으로 엄격하게 일관되지 않은 이미지로 NeRF를 학습시키기 위해, 우리의 접근 방식은 이미지별 기하학적 변형을 공유된 3D 표준 공간에서의 변형으로 간주합니다. 결과적으로, 우리는 변형 필드 테이블을 통해 이미지별 변형 세트를 학습함으로써 변형 가능한 NeRF의 표준 공간 내에 3D 아바타를 구축합니다. 우리는 실험 결과를 보여주고 우리 방법의 효과에 대해 논의합니다.

English

Recent advances in diffusion models such as ControlNet have enabled geometrically controllable, high-fidelity text-to-image generation. However, none of them addresses the question of adding such controllability to text-to-3D generation. In response, we propose Text2Control3D, a controllable text-to-3D avatar generation method whose facial expression is controllable given a monocular video casually captured with hand-held camera. Our main strategy is to construct the 3D avatar in Neural Radiance Fields (NeRF) optimized with a set of controlled viewpoint-aware images that we generate from ControlNet, whose condition input is the depth map extracted from the input video. When generating the viewpoint-aware images, we utilize cross-reference attention to inject well-controlled, referential facial expression and appearance via cross attention. We also conduct low-pass filtering of Gaussian latent of the diffusion model in order to ameliorate the viewpoint-agnostic texture problem we observed from our empirical analysis, where the viewpoint-aware images contain identical textures on identical pixel positions that are incomprehensible in 3D. Finally, to train NeRF with the images that are viewpoint-aware yet are not strictly consistent in geometry, our approach considers per-image geometric variation as a view of deformation from a shared 3D canonical space. Consequently, we construct the 3D avatar in a canonical space of deformable NeRF by learning a set of per-image deformation via deformation field table. We demonstrate the empirical results and discuss the effectiveness of our method.

Text2Control3D: 기하학적 가이드를 활용한 텍스트-이미지 확산 모델을 통한 제어 가능한 3D 아바타 생성

Text2Control3D: Controllable 3D Avatar Generation in Neural Radiance Fields using Geometry-Guided Text-to-Image Diffusion Model

초록

Support