ChatPaper.aiChatPaper

FantasyTalking2:面向音频驱动肖像动画的时序层自适应偏好优化

FantasyTalking2: Timestep-Layer Adaptive Preference Optimization for Audio-Driven Portrait Animation

August 15, 2025
作者: MengChao Wang, Qiang Wang, Fan Jiang, Mu Xu
cs.AI

摘要

近期,音频驱动的人物肖像动画技术取得了显著进展。然而,现有方法在满足多维度精细人类偏好方面仍面临挑战,如动作自然度、唇语同步精度及视觉质量等。这主要源于在相互冲突的偏好目标间进行优化的难度,以及缺乏大规模、高质量且带有多维偏好标注的数据集。为解决这些问题,我们首先引入了Talking-Critic,一种多模态奖励模型,它学习与人类对齐的奖励函数,以量化生成视频在多大程度上满足了多维度的期望。基于此模型,我们构建了Talking-NSQ,一个包含41万偏好对的大规模多维人类偏好数据集。最后,我们提出了时间步-层级自适应多专家偏好优化(TLPO),这是一个新颖的框架,旨在将基于扩散的肖像动画模型与精细、多维度的偏好对齐。TLPO将偏好分解为专门的专家模块,随后在时间步和网络层间进行融合,实现了所有维度上全面且精细的增强,而无需相互干扰。实验表明,Talking-Critic在与人偏好评分对齐方面显著优于现有方法。同时,TLPO在唇语同步精度、动作自然度和视觉质量上较基线模型均有大幅提升,在定性和定量评估中均展现出卓越性能。我们的项目页面:https://fantasy-amap.github.io/fantasy-talking2/
English
Recent advances in audio-driven portrait animation have demonstrated impressive capabilities. However, existing methods struggle to align with fine-grained human preferences across multiple dimensions, such as motion naturalness, lip-sync accuracy, and visual quality. This is due to the difficulty of optimizing among competing preference objectives, which often conflict with one another, and the scarcity of large-scale, high-quality datasets with multidimensional preference annotations. To address these, we first introduce Talking-Critic, a multimodal reward model that learns human-aligned reward functions to quantify how well generated videos satisfy multidimensional expectations. Leveraging this model, we curate Talking-NSQ, a large-scale multidimensional human preference dataset containing 410K preference pairs. Finally, we propose Timestep-Layer adaptive multi-expert Preference Optimization (TLPO), a novel framework for aligning diffusion-based portrait animation models with fine-grained, multidimensional preferences. TLPO decouples preferences into specialized expert modules, which are then fused across timesteps and network layers, enabling comprehensive, fine-grained enhancement across all dimensions without mutual interference. Experiments demonstrate that Talking-Critic significantly outperforms existing methods in aligning with human preference ratings. Meanwhile, TLPO achieves substantial improvements over baseline models in lip-sync accuracy, motion naturalness, and visual quality, exhibiting superior performance in both qualitative and quantitative evaluations. Ours project page: https://fantasy-amap.github.io/fantasy-talking2/
PDF92August 18, 2025