FantasyTalking2：面向音频驱动肖像动画的时序层自适应偏好优化

摘要

近期，音频驱动的人物肖像动画技术取得了显著进展。然而，现有方法在满足多维度精细人类偏好方面仍面临挑战，如动作自然度、唇语同步精度及视觉质量等。这主要源于在相互冲突的偏好目标间进行优化的难度，以及缺乏大规模、高质量且带有多维偏好标注的数据集。为解决这些问题，我们首先引入了Talking-Critic，一种多模态奖励模型，它学习与人类对齐的奖励函数，以量化生成视频在多大程度上满足了多维度的期望。基于此模型，我们构建了Talking-NSQ，一个包含41万偏好对的大规模多维人类偏好数据集。最后，我们提出了时间步-层级自适应多专家偏好优化（TLPO），这是一个新颖的框架，旨在将基于扩散的肖像动画模型与精细、多维度的偏好对齐。TLPO将偏好分解为专门的专家模块，随后在时间步和网络层间进行融合，实现了所有维度上全面且精细的增强，而无需相互干扰。实验表明，Talking-Critic在与人偏好评分对齐方面显著优于现有方法。同时，TLPO在唇语同步精度、动作自然度和视觉质量上较基线模型均有大幅提升，在定性和定量评估中均展现出卓越性能。我们的项目页面：https://fantasy-amap.github.io/fantasy-talking2/

English

Recent advances in audio-driven portrait animation have demonstrated impressive capabilities. However, existing methods struggle to align with fine-grained human preferences across multiple dimensions, such as motion naturalness, lip-sync accuracy, and visual quality. This is due to the difficulty of optimizing among competing preference objectives, which often conflict with one another, and the scarcity of large-scale, high-quality datasets with multidimensional preference annotations. To address these, we first introduce Talking-Critic, a multimodal reward model that learns human-aligned reward functions to quantify how well generated videos satisfy multidimensional expectations. Leveraging this model, we curate Talking-NSQ, a large-scale multidimensional human preference dataset containing 410K preference pairs. Finally, we propose Timestep-Layer adaptive multi-expert Preference Optimization (TLPO), a novel framework for aligning diffusion-based portrait animation models with fine-grained, multidimensional preferences. TLPO decouples preferences into specialized expert modules, which are then fused across timesteps and network layers, enabling comprehensive, fine-grained enhancement across all dimensions without mutual interference. Experiments demonstrate that Talking-Critic significantly outperforms existing methods in aligning with human preference ratings. Meanwhile, TLPO achieves substantial improvements over baseline models in lip-sync accuracy, motion naturalness, and visual quality, exhibiting superior performance in both qualitative and quantitative evaluations. Ours project page: https://fantasy-amap.github.io/fantasy-talking2/

FantasyTalking2：面向音频驱动肖像动画的时序层自适应偏好优化

FantasyTalking2: Timestep-Layer Adaptive Preference Optimization for Audio-Driven Portrait Animation

摘要

Support