ChatPaper.aiChatPaper

解耦身份,协同情感:基于相关性的情感肖像生成

Disentangle Identity, Cooperate Emotion: Correlation-Aware Emotional Talking Portrait Generation

April 25, 2025
作者: Weipeng Tan, Chuming Lin, Chengming Xu, FeiFan Xu, Xiaobin Hu, Xiaozhong Ji, Junwei Zhu, Chengjie Wang, Yanwei Fu
cs.AI

摘要

近期,说话头生成技术(THG)通过扩散模型在唇形同步和视觉质量方面取得了显著进展;然而,现有方法在生成情感丰富的肖像同时保持说话者身份方面仍面临挑战。我们识别出当前情感说话头生成中的三个关键局限:音频固有情感线索利用不足、情感表示中的身份泄露,以及情感相关性的孤立学习。为应对这些挑战,我们提出了一个名为DICE-Talk的新框架,其核心理念是解耦身份与情感,并协同具有相似特征的情感。首先,我们开发了一个解耦情感嵌入器,通过跨模态注意力联合建模音视频情感线索,将情感表示为与身份无关的高斯分布。其次,我们引入了一个相关性增强的情感条件模块,配备可学习的情感银行,通过向量量化和基于注意力的特征聚合显式捕捉情感间关系。第三,我们设计了一个情感判别目标,通过潜在空间分类在扩散过程中强制情感一致性。在MEAD和HDTF数据集上的大量实验证明了我们方法的优越性,在情感准确性上超越现有最先进方法,同时保持竞争力的唇形同步性能。定性结果和用户研究进一步证实了我们的方法能够生成保留身份的肖像,这些肖像具有丰富且相互关联的情感表达,并能自然地适应未见过的身份。
English
Recent advances in Talking Head Generation (THG) have achieved impressive lip synchronization and visual quality through diffusion models; yet existing methods struggle to generate emotionally expressive portraits while preserving speaker identity. We identify three critical limitations in current emotional talking head generation: insufficient utilization of audio's inherent emotional cues, identity leakage in emotion representations, and isolated learning of emotion correlations. To address these challenges, we propose a novel framework dubbed as DICE-Talk, following the idea of disentangling identity with emotion, and then cooperating emotions with similar characteristics. First, we develop a disentangled emotion embedder that jointly models audio-visual emotional cues through cross-modal attention, representing emotions as identity-agnostic Gaussian distributions. Second, we introduce a correlation-enhanced emotion conditioning module with learnable Emotion Banks that explicitly capture inter-emotion relationships through vector quantization and attention-based feature aggregation. Third, we design an emotion discrimination objective that enforces affective consistency during the diffusion process through latent-space classification. Extensive experiments on MEAD and HDTF datasets demonstrate our method's superiority, outperforming state-of-the-art approaches in emotion accuracy while maintaining competitive lip-sync performance. Qualitative results and user studies further confirm our method's ability to generate identity-preserving portraits with rich, correlated emotional expressions that naturally adapt to unseen identities.

Summary

AI-Generated Summary

PDF31April 30, 2025