跨模态情感迁移在说话人脸视频中的情感编辑应用
Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video
April 9, 2026
作者: Chanhyuk Choi, Taesoo Kim, Donggyu Lee, Siyeol Jung, Taehwan Kim
cs.AI
摘要
说话人脸生成作为生成模型的核心应用已获得广泛关注。为提升合成视频的表现力与真实感,情感编辑在其中起着关键作用。然而现有方法常受限于表现灵活性,难以生成复杂延伸情感。基于标签的方法使用离散类别表征情感,无法捕捉广泛的情感谱系;基于音频的方法虽能利用富含情感的语音信号(甚至受益于富有表现力的文本转语音合成),但由于情感与语言内容在情感语音中相互纠缠,难以准确表达目标情感;基于图像的方法依赖目标参考图像引导情感迁移,但需要高质量正面人脸视图,且在获取复杂延伸情感(如讽刺)的参考数据时面临挑战。为解决这些局限,我们提出跨模态情感迁移框架C-MET,该方法通过建模语音与视觉特征空间之间的情感语义向量,实现基于语音驱动的人脸表情生成。C-MET利用大规模预训练音频编码器与解耦的面部表情编码器,学习表征跨模态不同情感嵌入差异的情感语义向量。在MEAD和CREMA-D数据集上的大量实验表明,本方法的情感准确率较现有最优技术提升14%,并能生成富有表现力的说话人脸视频——即使对于未见的复杂延伸情感亦有效。代码、检查点及演示见https://chanhyeok-choi.github.io/C-MET/。
English
Talking face generation has gained significant attention as a core application of generative models. To enhance the expressiveness and realism of synthesized videos, emotion editing in talking face video plays a crucial role. However, existing approaches often limit expressive flexibility and struggle to generate extended emotions. Label-based methods represent emotions with discrete categories, which fail to capture a wide range of emotions. Audio-based methods can leverage emotionally rich speech signals - and even benefit from expressive text-to-speech (TTS) synthesis - but they fail to express the target emotions because emotions and linguistic contents are entangled in emotional speeches. Images-based methods, on the other hand, rely on target reference images to guide emotion transfer, yet they require high-quality frontal views and face challenges in acquiring reference data for extended emotions (e.g., sarcasm). To address these limitations, we propose Cross-Modal Emotion Transfer (C-MET), a novel approach that generates facial expressions based on speeches by modeling emotion semantic vectors between speech and visual feature spaces. C-MET leverages a large-scale pretrained audio encoder and a disentangled facial expression encoder to learn emotion semantic vectors that represent the difference between two different emotional embeddings across modalities. Extensive experiments on the MEAD and CREMA-D datasets demonstrate that our method improves emotion accuracy by 14% over state-of-the-art methods, while generating expressive talking face videos - even for unseen extended emotions. Code, checkpoint, and demo are available at https://chanhyeok-choi.github.io/C-MET/