FaithfulFaces: 텍스트-비디오 생성을 위한 포즈 충실 얼굴 정체성 보존

초록

정체성 보존 텍스트-비디오 생성(IPT2V)은 일관된 인간 얼굴 정체성을 유지하면서 다양하고 창의적인 비디오를 사용자가 제작할 수 있도록 한다. 최근의 진전에도 불구하고, 기존 방법들은 큰 얼굴 포즈 변화나 얼굴 폐색 하에서 심각한 정체성 왜곡을 겪는 경우가 많다. 본 논문에서는 복잡한 동적 장면에서 IPT2V를 개선하기 위한 포즈 충실 얼굴 정체성 보존 학습 프레임워크인 FaithfulFaces를 제안한다. FaithfulFaces의 핵심은 포즈 공유 사전과 포즈 변동-정체성 불변성 제약을 통해 서로 다른 시점 간의 얼굴 포즈를 정제하고 정렬하는 포즈 공유 정체성 정렬기이다. 단일 시점 입력을 명시적 오일러 각 임베딩을 갖는 전역 얼굴 포즈 표현으로 매핑함으로써, FaithfulFaces는 생성 기반이 강건한 정체성 보존 생성을 지향하도록 안내하는 포즈 충실 얼굴 사전 정보를 제공한다. 특히, 상당한 얼굴 포즈 다양성을 갖춘 고품질 비디오 데이터셋을 구축하기 위해 특화된 파이프라인을 개발하였다. 광범위한 실험을 통해 FaithfulFaces는 포즈 변화와 폐색이 발생하더라도 우수한 정체성 일관성과 구조적 선명도를 유지하며 최첨단 성능을 달성함을 입증하였다.

English

Identity-preserving text-to-video generation (IPT2V) empowers users to produce diverse and imaginative videos with consistent human facial identity. Despite recent progress, existing methods often suffer from significant identity distortion under large facial pose variations or facial occlusions. In this paper, we propose FaithfulFaces, a pose-faithful facial identity preservation learning framework to improve IPT2V in complex dynamic scenes. The key of FaithfulFaces is a pose-shared identity aligner that refines and aligns facial poses across distinct views via a pose-shared dictionary and a pose variation-identity invariance constraint. By mapping single-view inputs into a global facial pose representation with explicit Euler angle embeddings, FaithfulFaces provides a pose-faithful facial prior that guides generative foundations toward robust identity-preserving generation. In particular, we develop a specialized pipeline to curate a high-quality video dataset featuring substantial facial pose diversity. Extensive experiments demonstrate that FaithfulFaces achieves state-of-the-art performance, maintaining superior identity consistency and structural clarity even as pose changes and occlusions occur.

FaithfulFaces: 텍스트-비디오 생성을 위한 포즈 충실 얼굴 정체성 보존

FaithfulFaces: Pose-Faithful Facial Identity Preservation for Text-to-Video Generation

초록

Support