解離身份,協同情感:基於關聯感知的情感肖像生成
Disentangle Identity, Cooperate Emotion: Correlation-Aware Emotional Talking Portrait Generation
April 25, 2025
作者: Weipeng Tan, Chuming Lin, Chengming Xu, FeiFan Xu, Xiaobin Hu, Xiaozhong Ji, Junwei Zhu, Chengjie Wang, Yanwei Fu
cs.AI
摘要
近期,在說話頭像生成(Talking Head Generation, THG)領域,透過擴散模型已實現了令人印象深刻的唇形同步與視覺品質;然而,現有方法在生成情感豐富的肖像時,難以同時保持說話者身份的一致性。我們發現當前情感說話頭像生成存在三個關鍵限制:對音頻固有情感線索的利用不足、情感表示中的身份洩露,以及情感關聯的孤立學習。為應對這些挑戰,我們提出了一種名為DICE-Talk的新框架,其核心思想是先將身份與情感解耦,再協同具有相似特徵的情感。首先,我們開發了一個解耦情感嵌入器,通過跨模態注意力聯合建模音視頻情感線索,將情感表示為與身份無關的高斯分佈。其次,我們引入了一個增強相關性的情感條件模塊,配備可學習的情感銀行,通過向量量化和基於注意力的特徵聚合,顯式捕捉情感間的關聯。第三,我們設計了一個情感判別目標,通過潛在空間分類在擴散過程中強制情感一致性。在MEAD和HDTF數據集上的大量實驗表明,我們的方法在情感準確性上優於現有最先進的方法,同時保持了競爭力的唇形同步性能。定性結果和用戶研究進一步證實了我們的方法能夠生成保留身份的肖像,這些肖像具有豐富且相關的情感表達,並能自然地適應未見過的身份。
English
Recent advances in Talking Head Generation (THG) have achieved impressive lip
synchronization and visual quality through diffusion models; yet existing
methods struggle to generate emotionally expressive portraits while preserving
speaker identity. We identify three critical limitations in current emotional
talking head generation: insufficient utilization of audio's inherent emotional
cues, identity leakage in emotion representations, and isolated learning of
emotion correlations. To address these challenges, we propose a novel framework
dubbed as DICE-Talk, following the idea of disentangling identity with emotion,
and then cooperating emotions with similar characteristics. First, we develop a
disentangled emotion embedder that jointly models audio-visual emotional cues
through cross-modal attention, representing emotions as identity-agnostic
Gaussian distributions. Second, we introduce a correlation-enhanced emotion
conditioning module with learnable Emotion Banks that explicitly capture
inter-emotion relationships through vector quantization and attention-based
feature aggregation. Third, we design an emotion discrimination objective that
enforces affective consistency during the diffusion process through
latent-space classification. Extensive experiments on MEAD and HDTF datasets
demonstrate our method's superiority, outperforming state-of-the-art approaches
in emotion accuracy while maintaining competitive lip-sync performance.
Qualitative results and user studies further confirm our method's ability to
generate identity-preserving portraits with rich, correlated emotional
expressions that naturally adapt to unseen identities.Summary
AI-Generated Summary