FaceCLIPNeRF：使用可變形神經輻射場進行基於文本的3D人臉操作

摘要

隨著最近神經輻射場（NeRF）的進展，實現了高保真度的3D人臉重建和新視角合成，其操控也成為3D視覺中的重要任務。然而，現有的操控方法需要大量人力，例如使用者提供的語義遮罩和手動屬性搜索，不適合非專家使用者。相反，我們的方法旨在僅需一個文本即可操控使用NeRF重建的人臉。為此，我們首先訓練一個場景操控器，即一個潛在代碼條件變形NeRF，通過動態場景控制人臉變形使用潛在代碼。然而，用單個潛在代碼表示場景變形對於合成不同實例中觀察到的局部變形是不利的。因此，我們提出的位置條件錨定合成器（PAC）學習用空間變化的潛在代碼來表示操控的場景。它們與場景操控器的渲染然後被優化以在CLIP嵌入空間中對目標文本產生高餘弦相似度，以進行文本驅動的操控。據我們所知，我們的方法是第一個處理使用NeRF重建的人臉的文本驅動操控。大量結果、比較和消融研究證明了我們方法的有效性。

English

As recent advances in Neural Radiance Fields (NeRF) have enabled high-fidelity 3D face reconstruction and novel view synthesis, its manipulation also became an essential task in 3D vision. However, existing manipulation methods require extensive human labor, such as a user-provided semantic mask and manual attribute search unsuitable for non-expert users. Instead, our approach is designed to require a single text to manipulate a face reconstructed with NeRF. To do so, we first train a scene manipulator, a latent code-conditional deformable NeRF, over a dynamic scene to control a face deformation using the latent code. However, representing a scene deformation with a single latent code is unfavorable for compositing local deformations observed in different instances. As so, our proposed Position-conditional Anchor Compositor (PAC) learns to represent a manipulated scene with spatially varying latent codes. Their renderings with the scene manipulator are then optimized to yield high cosine similarity to a target text in CLIP embedding space for text-driven manipulation. To the best of our knowledge, our approach is the first to address the text-driven manipulation of a face reconstructed with NeRF. Extensive results, comparisons, and ablation studies demonstrate the effectiveness of our approach.

FaceCLIPNeRF：使用可變形神經輻射場進行基於文本的3D人臉操作

FaceCLIPNeRF: Text-driven 3D Face Manipulation using Deformable Neural Radiance Fields

摘要

Support