FaithfulFaces: 面向文本到视频生成的姿态忠实面部身份保持

摘要

身份保持文本到视频生成（IPT2V）技术使用户能够生成具有一致人脸身份的多样化、富有创意的视频。尽管近期取得进展，现有方法在面对较大的面部姿态变化或面部遮挡时，常出现严重的身份扭曲问题。本文提出了一种名为FaithfulFaces的姿态忠实面部身份保持学习框架，旨在提升复杂动态场景下的IPT2V性能。FaithfulFaces的核心是一个姿态共享的身份对齐器，该对齐器通过姿态共享字典和姿态变化-身份不变性约束，对不同视角下的面部姿态进行细化与对齐。通过将单视角输入映射为具有显式欧拉角嵌入的全局面部姿态表示，FaithfulFaces提供了姿态忠实的面部先验，从而引导生成基础模型实现鲁棒的身份保持生成。特别地，我们设计了一条专用流程，用于构建一个具有丰富面部姿态多样性的高质量视频数据集。大量实验表明，即使在姿态变化和遮挡发生时，FaithfulFaces仍能达到最先进性能，并保持卓越的身份一致性与结构清晰度。

English

Identity-preserving text-to-video generation (IPT2V) empowers users to produce diverse and imaginative videos with consistent human facial identity. Despite recent progress, existing methods often suffer from significant identity distortion under large facial pose variations or facial occlusions. In this paper, we propose FaithfulFaces, a pose-faithful facial identity preservation learning framework to improve IPT2V in complex dynamic scenes. The key of FaithfulFaces is a pose-shared identity aligner that refines and aligns facial poses across distinct views via a pose-shared dictionary and a pose variation-identity invariance constraint. By mapping single-view inputs into a global facial pose representation with explicit Euler angle embeddings, FaithfulFaces provides a pose-faithful facial prior that guides generative foundations toward robust identity-preserving generation. In particular, we develop a specialized pipeline to curate a high-quality video dataset featuring substantial facial pose diversity. Extensive experiments demonstrate that FaithfulFaces achieves state-of-the-art performance, maintaining superior identity consistency and structural clarity even as pose changes and occlusions occur.

FaithfulFaces: 面向文本到视频生成的姿态忠实面部身份保持

FaithfulFaces: Pose-Faithful Facial Identity Preservation for Text-to-Video Generation

摘要

Support