FantasyTalking2: 오디오 기반 초상화 애니메이션을 위한 시간 단계-레이어 적응형 선호 최적화

초록

최근 오디오 기반 초상화 애니메이션 기술은 놀라운 성능을 보여주고 있다. 그러나 기존 방법들은 동작의 자연스러움, 입술 싱크 정확도, 시각적 품질과 같은 다차원적 인간 선호도와의 정렬에 어려움을 겪고 있다. 이는 상호 충돌하는 선호 목표들 간의 최적화가 어렵고, 다차원적 선호도 주석이 포함된 대규모 고품질 데이터셋이 부족하기 때문이다. 이를 해결하기 위해, 우리는 먼저 생성된 비디오가 다차원적 기대치를 얼마나 잘 충족하는지를 정량화하기 위해 인간과 정렬된 보상 함수를 학습하는 다중모달 보상 모델인 Talking-Critic를 소개한다. 이 모델을 활용하여 410K의 선호도 쌍을 포함한 대규모 다차원적 인간 선호도 데이터셋인 Talking-NSQ를 구축하였다. 마지막으로, 우리는 디퓨전 기반 초상화 애니메이션 모델을 세밀한 다차원적 선호도와 정렬시키기 위한 새로운 프레임워크인 Timestep-Layer 적응형 다중 전문가 선호도 최적화(TLPO)를 제안한다. TLPO는 선호도를 전문가 모듈로 분리한 후, 이를 시간 단계와 네트워크 계층 간에 융합하여 상호 간섭 없이 모든 차원에서 포괄적이고 세밀한 개선을 가능하게 한다. 실험 결과, Talking-Critic는 인간 선호도 평가와의 정렬에서 기존 방법들을 크게 능가하는 것으로 나타났다. 한편, TLPO는 입술 싱크 정확도, 동작의 자연스러움, 시각적 품질에서 베이스라인 모델 대비 상당한 개선을 달성하였으며, 질적 및 양적 평가 모두에서 우수한 성능을 보였다. 프로젝트 페이지: https://fantasy-amap.github.io/fantasy-talking2/

English

Recent advances in audio-driven portrait animation have demonstrated impressive capabilities. However, existing methods struggle to align with fine-grained human preferences across multiple dimensions, such as motion naturalness, lip-sync accuracy, and visual quality. This is due to the difficulty of optimizing among competing preference objectives, which often conflict with one another, and the scarcity of large-scale, high-quality datasets with multidimensional preference annotations. To address these, we first introduce Talking-Critic, a multimodal reward model that learns human-aligned reward functions to quantify how well generated videos satisfy multidimensional expectations. Leveraging this model, we curate Talking-NSQ, a large-scale multidimensional human preference dataset containing 410K preference pairs. Finally, we propose Timestep-Layer adaptive multi-expert Preference Optimization (TLPO), a novel framework for aligning diffusion-based portrait animation models with fine-grained, multidimensional preferences. TLPO decouples preferences into specialized expert modules, which are then fused across timesteps and network layers, enabling comprehensive, fine-grained enhancement across all dimensions without mutual interference. Experiments demonstrate that Talking-Critic significantly outperforms existing methods in aligning with human preference ratings. Meanwhile, TLPO achieves substantial improvements over baseline models in lip-sync accuracy, motion naturalness, and visual quality, exhibiting superior performance in both qualitative and quantitative evaluations. Ours project page: https://fantasy-amap.github.io/fantasy-talking2/

FantasyTalking2: 오디오 기반 초상화 애니메이션을 위한 시간 단계-레이어 적응형 선호 최적화

FantasyTalking2: Timestep-Layer Adaptive Preference Optimization for Audio-Driven Portrait Animation

초록

Support