FaithfulFaces: ポーズに忠実な顔の同一性保存のためのテキストから動画生成

要旨

同一人物性を保持したテキストからビデオへの生成（IPT2V）は、ユーザが一貫した顔の同一性を持つ多様で想像力豊かなビデオを生成することを可能にする。最近の進歩にもかかわらず、既存の手法は大きな顔のポーズの変化や顔の遮蔽下で、しばしば著しい同一性の歪みに悩まされている。本論文では、複雑な動的シーンにおけるIPT2Vを改善するために、ポーズに忠実な顔の同一性保持学習フレームワークであるFaithfulFacesを提案する。FaithfulFacesの鍵は、ポーズ共有辞書とポーズ変動-同一性不変制約を介して、異なる視点間の顔のポーズを洗練し位置合わせするポーズ共有同一性整列器である。単一視点の入力を明示的なオイラー角埋め込みを持つグローバルな顔のポーズ表現にマッピングすることにより、FaithfulFacesはポーズに忠実な顔の事前情報を提供し、生成基盤を頑健な同一性保持生成へと導く。特に、我々は大きな顔のポーズの多様性を特徴とする高品質なビデオデータセットをキュレーションするための専門的なパイプラインを開発する。広範な実験により、FaithfulFacesは最先端の性能を達成し、ポーズの変化や遮蔽が生じても優れた同一性の一貫性と構造的明瞭性を維持することを示す。

English

Identity-preserving text-to-video generation (IPT2V) empowers users to produce diverse and imaginative videos with consistent human facial identity. Despite recent progress, existing methods often suffer from significant identity distortion under large facial pose variations or facial occlusions. In this paper, we propose FaithfulFaces, a pose-faithful facial identity preservation learning framework to improve IPT2V in complex dynamic scenes. The key of FaithfulFaces is a pose-shared identity aligner that refines and aligns facial poses across distinct views via a pose-shared dictionary and a pose variation-identity invariance constraint. By mapping single-view inputs into a global facial pose representation with explicit Euler angle embeddings, FaithfulFaces provides a pose-faithful facial prior that guides generative foundations toward robust identity-preserving generation. In particular, we develop a specialized pipeline to curate a high-quality video dataset featuring substantial facial pose diversity. Extensive experiments demonstrate that FaithfulFaces achieves state-of-the-art performance, maintaining superior identity consistency and structural clarity even as pose changes and occlusions occur.

FaithfulFaces: ポーズに忠実な顔の同一性保存のためのテキストから動画生成

FaithfulFaces: Pose-Faithful Facial Identity Preservation for Text-to-Video Generation

要旨

Support