RealTalk: 3D顔面事前情報を活用したアイデンティティ整合ネットワークによるリアルタイムでリアルな音声駆動顔生成

要旨

人物非依存の音声駆動型顔生成は、コンピュータビジョンにおける挑戦的な課題である。従来の手法は音声と視覚の同期において顕著な進展を遂げてきたが、現在の結果と実用化の間には依然として大きな隔たりがある。この課題は二つの側面に分けられる：1）高精度な唇の同期を実現するための個々の特徴の保持。2）リアルタイム性能での高品質な顔レンダリングの生成。本論文では、新たな汎用音声駆動フレームワーク「RealTalk」を提案する。これは、音声から表情へのトランスフォーマーと、高精細な表情から顔へのレンダラーで構成される。最初のコンポーネントでは、話す唇の動きに関連する個人の特徴と個人内の変動特徴の両方を考慮する。強化された顔の事前情報に対するクロスモーダルアテンションを組み込むことで、唇の動きを音声と効果的に同期させ、表情予測の精度を向上させることができる。第二のコンポーネントでは、軽量な顔の同一性アライメント（FIA）モジュールを設計する。これには唇形状制御構造と顔テクスチャ参照構造が含まれる。この新しい設計により、複雑で非効率的な特徴アライメントモジュールに依存することなく、リアルタイムで細部を生成することが可能となる。公開データセットにおける定量的および定性的な実験結果は、本手法が唇と音声の同期および生成品質において明確な優位性を持つことを示している。さらに、本手法は効率的で計算リソースを必要としないため、実用化のニーズに適している。

English

Person-generic audio-driven face generation is a challenging task in computer vision. Previous methods have achieved remarkable progress in audio-visual synchronization, but there is still a significant gap between current results and practical applications. The challenges are two-fold: 1) Preserving unique individual traits for achieving high-precision lip synchronization. 2) Generating high-quality facial renderings in real-time performance. In this paper, we propose a novel generalized audio-driven framework RealTalk, which consists of an audio-to-expression transformer and a high-fidelity expression-to-face renderer. In the first component, we consider both identity and intra-personal variation features related to speaking lip movements. By incorporating cross-modal attention on the enriched facial priors, we can effectively align lip movements with audio, thus attaining greater precision in expression prediction. In the second component, we design a lightweight facial identity alignment (FIA) module which includes a lip-shape control structure and a face texture reference structure. This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules. Our experimental results, both quantitative and qualitative, on public datasets demonstrate the clear advantages of our method in terms of lip-speech synchronization and generation quality. Furthermore, our method is efficient and requires fewer computational resources, making it well-suited to meet the needs of practical applications.

RealTalk: 3D顔面事前情報を活用したアイデンティティ整合ネットワークによるリアルタイムでリアルな音声駆動顔生成

RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network

要旨

Support