Text2Control3D:使用几何引导的文本到图像扩散模型在神经辐射场中生成可控的3D头像。
Text2Control3D: Controllable 3D Avatar Generation in Neural Radiance Fields using Geometry-Guided Text-to-Image Diffusion Model
September 7, 2023
作者: Sungwon Hwang, Junha Hyung, Jaegul Choo
cs.AI
摘要
最近扩散模型的进展,如ControlNet,已实现了几何可控、高保真度的文本到图像生成。然而,它们中没有一个解决了将这种可控性添加到文本到三维生成的问题。为此,我们提出了Text2Control3D,一种可控的文本到三维头像生成方法,其面部表情可在使用手持摄像机随意拍摄的单目视频中进行控制。我们的主要策略是在神经辐射场(NeRF)中构建3D头像,通过一组从ControlNet生成的受控视角感知图像进行优化,其条件输入是从输入视频中提取的深度图。在生成视角感知图像时,我们利用交叉参考注意力通过交叉关注注入良好控制的、参考性的面部表情和外观。我们还对扩散模型的高斯潜变量进行低通滤波,以改善我们从经验分析中观察到的与视角无关的纹理问题,即视角感知图像在相同像素位置包含相同纹理,这在三维中是难以理解的。最后,为了训练NeRF,使其能处理视角感知但在几何上不严格一致的图像,我们的方法将每个图像的几何变化视为从共享的三维规范空间中的变形。因此,我们通过学习一组通过变形场表的每个图像变形来在可变NeRF的规范空间中构建3D头像。我们展示了实证结果并讨论了我们方法的有效性。
English
Recent advances in diffusion models such as ControlNet have enabled
geometrically controllable, high-fidelity text-to-image generation. However,
none of them addresses the question of adding such controllability to
text-to-3D generation. In response, we propose Text2Control3D, a controllable
text-to-3D avatar generation method whose facial expression is controllable
given a monocular video casually captured with hand-held camera. Our main
strategy is to construct the 3D avatar in Neural Radiance Fields (NeRF)
optimized with a set of controlled viewpoint-aware images that we generate from
ControlNet, whose condition input is the depth map extracted from the input
video. When generating the viewpoint-aware images, we utilize cross-reference
attention to inject well-controlled, referential facial expression and
appearance via cross attention. We also conduct low-pass filtering of Gaussian
latent of the diffusion model in order to ameliorate the viewpoint-agnostic
texture problem we observed from our empirical analysis, where the
viewpoint-aware images contain identical textures on identical pixel positions
that are incomprehensible in 3D. Finally, to train NeRF with the images that
are viewpoint-aware yet are not strictly consistent in geometry, our approach
considers per-image geometric variation as a view of deformation from a shared
3D canonical space. Consequently, we construct the 3D avatar in a canonical
space of deformable NeRF by learning a set of per-image deformation via
deformation field table. We demonstrate the empirical results and discuss the
effectiveness of our method.