FaceCLIPNeRF: 변형 가능한 신경 방사 필드를 이용한 텍스트 기반 3D 얼굴 조작

초록

최근 Neural Radiance Fields(NeRF)의 발전으로 고품질 3D 얼굴 재구성과 새로운 시점 합성이 가능해지면서, 이를 조작하는 것도 3D 비전 분야에서 필수적인 과제가 되었습니다. 그러나 기존의 조작 방법들은 사용자가 제공한 시맨틱 마스크나 수동 속성 탐색과 같은 광범위한 인력이 필요하여 비전문가 사용자에게는 적합하지 않았습니다. 대신, 우리의 접근 방식은 NeRF로 재구성된 얼굴을 조작하기 위해 단일 텍스트만을 요구하도록 설계되었습니다. 이를 위해, 우리는 먼저 동적 장면 위에서 장면 조작기(latent code-conditional deformable NeRF)를 학습시켜 latent code를 사용하여 얼굴 변형을 제어합니다. 그러나 단일 latent code로 장면 변형을 표현하는 것은 서로 다른 인스턴스에서 관찰된 지역적 변형을 합성하기에는 불리합니다. 따라서, 우리가 제안한 Position-conditional Anchor Compositor(PAC)는 공간적으로 변화하는 latent code를 사용하여 조작된 장면을 표현하도록 학습합니다. 그런 다음, 장면 조작기를 통해 렌더링된 결과는 CLIP 임베딩 공간에서 목표 텍스트와 높은 코사인 유사도를 가지도록 최적화되어 텍스트 기반 조작을 가능하게 합니다. 우리가 알고 있는 한, 우리의 접근 방식은 NeRF로 재구성된 얼굴의 텍스트 기반 조작을 다룬 첫 번째 사례입니다. 광범위한 결과, 비교 및 ablation 연구를 통해 우리 접근 방식의 효과성을 입증합니다.

English

As recent advances in Neural Radiance Fields (NeRF) have enabled high-fidelity 3D face reconstruction and novel view synthesis, its manipulation also became an essential task in 3D vision. However, existing manipulation methods require extensive human labor, such as a user-provided semantic mask and manual attribute search unsuitable for non-expert users. Instead, our approach is designed to require a single text to manipulate a face reconstructed with NeRF. To do so, we first train a scene manipulator, a latent code-conditional deformable NeRF, over a dynamic scene to control a face deformation using the latent code. However, representing a scene deformation with a single latent code is unfavorable for compositing local deformations observed in different instances. As so, our proposed Position-conditional Anchor Compositor (PAC) learns to represent a manipulated scene with spatially varying latent codes. Their renderings with the scene manipulator are then optimized to yield high cosine similarity to a target text in CLIP embedding space for text-driven manipulation. To the best of our knowledge, our approach is the first to address the text-driven manipulation of a face reconstructed with NeRF. Extensive results, comparisons, and ablation studies demonstrate the effectiveness of our approach.

FaceCLIPNeRF: 변형 가능한 신경 방사 필드를 이용한 텍스트 기반 3D 얼굴 조작

FaceCLIPNeRF: Text-driven 3D Face Manipulation using Deformable Neural Radiance Fields

초록

Support