Text2Control3D：使用几何引导的文本到图像扩散模型在神经辐射场中生成可控的3D头像。

摘要

最近扩散模型的进展，如ControlNet，已实现了几何可控、高保真度的文本到图像生成。然而，它们中没有一个解决了将这种可控性添加到文本到三维生成的问题。为此，我们提出了Text2Control3D，一种可控的文本到三维头像生成方法，其面部表情可在使用手持摄像机随意拍摄的单目视频中进行控制。我们的主要策略是在神经辐射场（NeRF）中构建3D头像，通过一组从ControlNet生成的受控视角感知图像进行优化，其条件输入是从输入视频中提取的深度图。在生成视角感知图像时，我们利用交叉参考注意力通过交叉关注注入良好控制的、参考性的面部表情和外观。我们还对扩散模型的高斯潜变量进行低通滤波，以改善我们从经验分析中观察到的与视角无关的纹理问题，即视角感知图像在相同像素位置包含相同纹理，这在三维中是难以理解的。最后，为了训练NeRF，使其能处理视角感知但在几何上不严格一致的图像，我们的方法将每个图像的几何变化视为从共享的三维规范空间中的变形。因此，我们通过学习一组通过变形场表的每个图像变形来在可变NeRF的规范空间中构建3D头像。我们展示了实证结果并讨论了我们方法的有效性。

English

Recent advances in diffusion models such as ControlNet have enabled geometrically controllable, high-fidelity text-to-image generation. However, none of them addresses the question of adding such controllability to text-to-3D generation. In response, we propose Text2Control3D, a controllable text-to-3D avatar generation method whose facial expression is controllable given a monocular video casually captured with hand-held camera. Our main strategy is to construct the 3D avatar in Neural Radiance Fields (NeRF) optimized with a set of controlled viewpoint-aware images that we generate from ControlNet, whose condition input is the depth map extracted from the input video. When generating the viewpoint-aware images, we utilize cross-reference attention to inject well-controlled, referential facial expression and appearance via cross attention. We also conduct low-pass filtering of Gaussian latent of the diffusion model in order to ameliorate the viewpoint-agnostic texture problem we observed from our empirical analysis, where the viewpoint-aware images contain identical textures on identical pixel positions that are incomprehensible in 3D. Finally, to train NeRF with the images that are viewpoint-aware yet are not strictly consistent in geometry, our approach considers per-image geometric variation as a view of deformation from a shared 3D canonical space. Consequently, we construct the 3D avatar in a canonical space of deformable NeRF by learning a set of per-image deformation via deformation field table. We demonstrate the empirical results and discuss the effectiveness of our method.

Text2Control3D：使用几何引导的文本到图像扩散模型在神经辐射场中生成可控的3D头像。

Text2Control3D: Controllable 3D Avatar Generation in Neural Radiance Fields using Geometry-Guided Text-to-Image Diffusion Model

摘要

Support