FaceCLIPNeRF：使用可变形神经辐射场进行文本驱动的3D人脸操作

摘要

随着最近神经辐射场（NeRF）的进展，实现了高保真度的3D面部重建和新视角合成，其操控也成为3D视觉中的重要任务。然而，现有的操控方法需要大量人力，例如用户提供的语义蒙版和手动属性搜索，不适合非专业用户。相反，我们的方法旨在通过单个文本来操控使用NeRF重建的面部。为此，我们首先训练一个场景操控器，即一种潜在代码条件变形NeRF，用于在动态场景中控制面部变形。然而，用单个潜在代码表示场景变形对于合成不同实例中观察到的局部变形是不利的。因此，我们提出的位置条件锚定合成器（PAC）学习用空间变化的潜在代码表示操控后的场景。它们与场景操控器的渲染然后被优化，以在CLIP嵌入空间中与目标文本具有高余弦相似性，以实现文本驱动的操控。据我们所知，我们的方法是首个解决使用NeRF重建的面部进行文本驱动操控的方法。大量结果、比较和消融研究证明了我们方法的有效性。

English

As recent advances in Neural Radiance Fields (NeRF) have enabled high-fidelity 3D face reconstruction and novel view synthesis, its manipulation also became an essential task in 3D vision. However, existing manipulation methods require extensive human labor, such as a user-provided semantic mask and manual attribute search unsuitable for non-expert users. Instead, our approach is designed to require a single text to manipulate a face reconstructed with NeRF. To do so, we first train a scene manipulator, a latent code-conditional deformable NeRF, over a dynamic scene to control a face deformation using the latent code. However, representing a scene deformation with a single latent code is unfavorable for compositing local deformations observed in different instances. As so, our proposed Position-conditional Anchor Compositor (PAC) learns to represent a manipulated scene with spatially varying latent codes. Their renderings with the scene manipulator are then optimized to yield high cosine similarity to a target text in CLIP embedding space for text-driven manipulation. To the best of our knowledge, our approach is the first to address the text-driven manipulation of a face reconstructed with NeRF. Extensive results, comparisons, and ablation studies demonstrate the effectiveness of our approach.

FaceCLIPNeRF：使用可变形神经辐射场进行文本驱动的3D人脸操作

FaceCLIPNeRF: Text-driven 3D Face Manipulation using Deformable Neural Radiance Fields

摘要

Support