跨模态情感迁移在说话人脸视频情感编辑中的应用
Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video
April 9, 2026
作者: Chanhyuk Choi, Taesoo Kim, Donggyu Lee, Siyeol Jung, Taehwan Kim
cs.AI
摘要
作为生成模型的核心应用,说话人脸生成技术备受关注。为提升合成视频的表现力与真实感,情感编辑在说话人脸视频中起着关键作用。然而现有方法常受限于表现灵活性,且难以生成延续性情感。基于标签的方法使用离散类别表征情感,无法捕捉广泛的情感谱系;基于音频的方法虽可利用富含情感的语音信号(甚至受益于富有表现力的文本转语音合成),但由于情感与语言内容在情感语音中相互纠缠,难以准确表达目标情感;基于图像的方法依赖目标参考图像引导情感迁移,但需高质量正面视角图像,且在获取延续性情感(如讽刺)的参考数据时面临挑战。为解决这些局限,我们提出跨模态情感迁移(C-MET)新方法,通过建模语音与视觉特征空间之间的情感语义向量,实现基于语音的面部表情生成。C-MET利用大规模预训练音频编码器与解耦的面部表情编码器,学习表征跨模态不同情感嵌入差异的情感语义向量。在MEAD和CREMA-D数据集上的大量实验表明,本方法的情感准确率较现有最优技术提升14%,即使对未见的延续性情感也能生成富有表现力的说话人脸视频。代码、检查点及演示详见https://chanhyeok-choi.github.io/C-MET/。
English
Talking face generation has gained significant attention as a core application of generative models. To enhance the expressiveness and realism of synthesized videos, emotion editing in talking face video plays a crucial role. However, existing approaches often limit expressive flexibility and struggle to generate extended emotions. Label-based methods represent emotions with discrete categories, which fail to capture a wide range of emotions. Audio-based methods can leverage emotionally rich speech signals - and even benefit from expressive text-to-speech (TTS) synthesis - but they fail to express the target emotions because emotions and linguistic contents are entangled in emotional speeches. Images-based methods, on the other hand, rely on target reference images to guide emotion transfer, yet they require high-quality frontal views and face challenges in acquiring reference data for extended emotions (e.g., sarcasm). To address these limitations, we propose Cross-Modal Emotion Transfer (C-MET), a novel approach that generates facial expressions based on speeches by modeling emotion semantic vectors between speech and visual feature spaces. C-MET leverages a large-scale pretrained audio encoder and a disentangled facial expression encoder to learn emotion semantic vectors that represent the difference between two different emotional embeddings across modalities. Extensive experiments on the MEAD and CREMA-D datasets demonstrate that our method improves emotion accuracy by 14% over state-of-the-art methods, while generating expressive talking face videos - even for unseen extended emotions. Code, checkpoint, and demo are available at https://chanhyeok-choi.github.io/C-MET/