FantasyTalking2: 音声駆動ポートレートアニメーションのためのタイムステップ層適応型選好最適化

要旨

最近の音声駆動ポートレートアニメーションの進歩は、印象的な能力を示しています。しかし、既存の手法は、動きの自然さ、リップシンクの精度、視覚的品質など、複数の次元にわたる細かい人間の好みに合わせることが困難です。これは、しばしば互いに競合する好みの目標を最適化することの難しさと、多次元的な好みの注釈が付いた大規模で高品質なデータセットの不足によるものです。これらの問題に対処するため、我々はまず、生成されたビデオが多次元的な期待をどれだけ満たしているかを定量化する人間に沿った報酬関数を学習するマルチモーダル報酬モデルであるTalking-Criticを導入します。このモデルを活用して、41万の好みペアを含む大規模な多次元人間好みデータセットであるTalking-NSQをキュレーションします。最後に、拡散ベースのポートレートアニメーションモデルを細かい多次元的な好みに合わせるための新しいフレームワークであるTimestep-Layer adaptive multi-expert Preference Optimization (TLPO)を提案します。TLPOは、好みを専門のエキスパートモジュールに分離し、それらをタイムステップとネットワーク層にわたって融合させることで、相互干渉なしにすべての次元にわたる包括的で細かい強化を可能にします。実験結果は、Talking-Criticが人間の好み評価に合わせる点で既存の手法を大幅に上回ることを示しています。一方、TLPOはベースラインモデルに対してリップシンクの精度、動きの自然さ、視覚的品質において大幅な改善を達成し、定性的および定量的な評価の両方で優れた性能を示しています。プロジェクトページ: https://fantasy-amap.github.io/fantasy-talking2/

English

Recent advances in audio-driven portrait animation have demonstrated impressive capabilities. However, existing methods struggle to align with fine-grained human preferences across multiple dimensions, such as motion naturalness, lip-sync accuracy, and visual quality. This is due to the difficulty of optimizing among competing preference objectives, which often conflict with one another, and the scarcity of large-scale, high-quality datasets with multidimensional preference annotations. To address these, we first introduce Talking-Critic, a multimodal reward model that learns human-aligned reward functions to quantify how well generated videos satisfy multidimensional expectations. Leveraging this model, we curate Talking-NSQ, a large-scale multidimensional human preference dataset containing 410K preference pairs. Finally, we propose Timestep-Layer adaptive multi-expert Preference Optimization (TLPO), a novel framework for aligning diffusion-based portrait animation models with fine-grained, multidimensional preferences. TLPO decouples preferences into specialized expert modules, which are then fused across timesteps and network layers, enabling comprehensive, fine-grained enhancement across all dimensions without mutual interference. Experiments demonstrate that Talking-Critic significantly outperforms existing methods in aligning with human preference ratings. Meanwhile, TLPO achieves substantial improvements over baseline models in lip-sync accuracy, motion naturalness, and visual quality, exhibiting superior performance in both qualitative and quantitative evaluations. Ours project page: https://fantasy-amap.github.io/fantasy-talking2/

FantasyTalking2: 音声駆動ポートレートアニメーションのためのタイムステップ層適応型選好最適化

FantasyTalking2: Timestep-Layer Adaptive Preference Optimization for Audio-Driven Portrait Animation

要旨

Support