知覚的に正確な3Dトーキングヘッド生成：新たな定義、スピーチメッシュ表現、および評価指標

要旨

近年の音声駆動型3Dトーキングヘッド生成における進展は、唇の同期化において大きな進歩を遂げています。しかし、既存のモデルは依然として、多様な音声特性とそれに対応する唇の動きの間の知覚的整合性を捉えることに苦戦しています。本研究では、知覚的に正確な唇の動きを実現するためには、時間的同期化、唇の読み取りやすさ、表現力という3つの基準が重要であると主張します。これら3つの基準を満たす望ましい表現空間が存在するという仮説に基づき、音声信号と3D顔面メッシュの間の複雑な対応関係を捉える音声-メッシュ同期化表現を提案します。学習されたこの表現が望ましい特性を示すことを確認し、既存のモデルに知覚的損失として組み込むことで、与えられた音声に対する唇の動きをより良く整合させます。さらに、この表現を知覚的指標として活用し、他の2つの物理的に基づいた唇同期化指標を導入して、生成された3Dトーキングヘッドがこれら3つの基準にどれだけ整合しているかを評価します。実験結果から、提案する知覚的損失を用いて3Dトーキングヘッド生成モデルを訓練することで、知覚的に正確な唇同期化の3つの側面すべてが大幅に改善されることが示されました。コードとデータセットはhttps://perceptual-3d-talking-head.github.io/で公開されています。

English

Recent advancements in speech-driven 3D talking head generation have made significant progress in lip synchronization. However, existing models still struggle to capture the perceptual alignment between varying speech characteristics and corresponding lip movements. In this work, we claim that three criteria -- Temporal Synchronization, Lip Readability, and Expressiveness -- are crucial for achieving perceptually accurate lip movements. Motivated by our hypothesis that a desirable representation space exists to meet these three criteria, we introduce a speech-mesh synchronized representation that captures intricate correspondences between speech signals and 3D face meshes. We found that our learned representation exhibits desirable characteristics, and we plug it into existing models as a perceptual loss to better align lip movements to the given speech. In addition, we utilize this representation as a perceptual metric and introduce two other physically grounded lip synchronization metrics to assess how well the generated 3D talking heads align with these three criteria. Experiments show that training 3D talking head generation models with our perceptual loss significantly improve all three aspects of perceptually accurate lip synchronization. Codes and datasets are available at https://perceptual-3d-talking-head.github.io/.

知覚的に正確な3Dトーキングヘッド生成：新たな定義、スピーチメッシュ表現、および評価指標

Perceptually Accurate 3D Talking Head Generation: New Definitions, Speech-Mesh Representation, and Evaluation Metrics

要旨

Support