Points-to-3D: 희소 포인트와 형태 제어 가능한 텍스트-3D 생성 간의 간극 해소

초록

텍스트-3D 생성은 최근 수십억 개의 이미지-텍스트 쌍으로 학습된 2D 확산 모델을 기반으로 상당한 관심을 받고 있다. 기존 방법들은 주로 점수 증류를 통해 2D 확산 사전 지식을 활용하여 NeRF와 같은 3D 모델 생성을 감독한다. 그러나 점수 증류는 시점 불일치 문제를 겪기 쉬우며, 암묵적인 NeRF 모델링은 임의의 형태를 초래할 수 있어 현실적이지 않고 통제 불가능한 3D 생성을 야기한다. 본 연구에서는 희소하지만 자유롭게 사용 가능한 3D 포인트와 현실적이며 형태 제어가 가능한 3D 생성 간의 격차를 해소하기 위해 2D 및 3D 확산 모델의 지식을 증류하는 Points-to-3D라는 유연한 프레임워크를 제안한다. Points-to-3D의 핵심 아이디어는 텍스트-3D 생성을 안내하기 위해 제어 가능한 희소 3D 포인트를 도입하는 것이다. 구체적으로, 단일 참조 이미지를 기반으로 3D 확산 모델인 Point-E에서 생성된 희소 포인트 클라우드를 기하학적 사전 지식으로 사용한다. 희소 3D 포인트를 더 효과적으로 활용하기 위해, NeRF의 기하학이 희소 3D 포인트의 형태와 일치하도록 적응적으로 유도하는 효율적인 포인트 클라우드 지도 손실을 제안한다. 기하학을 제어하는 것 외에도, NeRF를 더 일관된 시점의 외관을 위해 최적화한다. 구체적으로, 학습된 간결한 기하학의 깊이 맵과 텍스트를 기반으로 공개된 2D 이미지 확산 모델인 ControlNet에 점수 증류를 수행한다. 정성적 및 정량적 비교를 통해 Points-to-3D가 텍스트-3D 생성에서 시점 일관성을 개선하고 우수한 형태 제어성을 달성함을 입증한다. Points-to-3D는 사용자에게 텍스트-3D 생성을 개선하고 제어할 수 있는 새로운 방법을 제공한다.

English

Text-to-3D generation has recently garnered significant attention, fueled by 2D diffusion models trained on billions of image-text pairs. Existing methods primarily rely on score distillation to leverage the 2D diffusion priors to supervise the generation of 3D models, e.g., NeRF. However, score distillation is prone to suffer the view inconsistency problem, and implicit NeRF modeling can also lead to an arbitrary shape, thus leading to less realistic and uncontrollable 3D generation. In this work, we propose a flexible framework of Points-to-3D to bridge the gap between sparse yet freely available 3D points and realistic shape-controllable 3D generation by distilling the knowledge from both 2D and 3D diffusion models. The core idea of Points-to-3D is to introduce controllable sparse 3D points to guide the text-to-3D generation. Specifically, we use the sparse point cloud generated from the 3D diffusion model, Point-E, as the geometric prior, conditioned on a single reference image. To better utilize the sparse 3D points, we propose an efficient point cloud guidance loss to adaptively drive the NeRF's geometry to align with the shape of the sparse 3D points. In addition to controlling the geometry, we propose to optimize the NeRF for a more view-consistent appearance. To be specific, we perform score distillation to the publicly available 2D image diffusion model ControlNet, conditioned on text as well as depth map of the learned compact geometry. Qualitative and quantitative comparisons demonstrate that Points-to-3D improves view consistency and achieves good shape controllability for text-to-3D generation. Points-to-3D provides users with a new way to improve and control text-to-3D generation.

Points-to-3D: 희소 포인트와 형태 제어 가능한 텍스트-3D 생성 간의 간극 해소

Points-to-3D: Bridging the Gap between Sparse Points and Shape-Controllable Text-to-3D Generation

초록

Support