대화형 얼굴 비디오에서의 감정 편집을 위한 교차 모달 감정 전이

초록

대화하는 얼굴 생성은 생성 모델의 핵심 응용 분야로서 큰 관심을 받아왔다. 합성된 비디오의 표현력과 사실감을 높이기 위해 대화하는 얼굴 비디오에서의 감정 편집은 중요한 역할을 한다. 그러나 기존 방법들은 표현적 유연성을 제한하는 경우가 많으며 지속적인 감정을 생성하는 데 어려움을 겪는다. 레이블 기반 방법은 감정을 이산적인 범주로 표현하지만, 이는 다양한 감정 범주를 포착하지 못한다. 오디오 기반 방법은 감정이 풍부한 음성 신호를 활용할 수 있고 표현력 있는 텍스트-음성 합성(TTS)의 이점을 얻을 수 있지만, 감정과 언어적 내용이 감정 음성에서 서로 얽혀 있기 때문에 목표 감정을 표현하는 데 실패한다. 반면 이미지 기반 방법은 감정 전이를 안내하기 위해 대상 참조 이미지에 의존하지만, 고품질 정면 뷰가 필요하며 지속적인 감정(예: 비꼼)에 대한 참조 데이터를 확보하는 데 어려움을 겪는다. 이러한 한계를 해결하기 위해 우리는 음성과 시각적 특징 공간 간의 감정 의미 벡터를 모델링하여 음성을 기반으로 얼굴 표정을 생성하는 새로운 접근법인 Cross-Modal Emotion Transfer (C-MET)를 제안한다. C-MET는 대규모 사전 학습된 오디오 인코더와 분리된 얼굴 표현 인코더를 활용하여 다양한 모달리티 간의 두 가지 다른 감정 임베딩 간의 차이를 나타내는 감정 의미 벡터를 학습한다. MEAD 및 CREMA-D 데이터셋에서의 광범위한 실험을 통해 우리의 방법이 최신 방법 대비 감정 정확도를 14% 향상시키면서도 표현력 있는 대화하는 얼굴 비디오를 생성함을 입증했다. 특히 보지 않은 지속적인 감정에서도 우수한 성능을 보인다. 코드, 체크포인트 및 데모는 https://chanhyeok-choi.github.io/C-MET/에서 확인할 수 있다.

English

Talking face generation has gained significant attention as a core application of generative models. To enhance the expressiveness and realism of synthesized videos, emotion editing in talking face video plays a crucial role. However, existing approaches often limit expressive flexibility and struggle to generate extended emotions. Label-based methods represent emotions with discrete categories, which fail to capture a wide range of emotions. Audio-based methods can leverage emotionally rich speech signals - and even benefit from expressive text-to-speech (TTS) synthesis - but they fail to express the target emotions because emotions and linguistic contents are entangled in emotional speeches. Images-based methods, on the other hand, rely on target reference images to guide emotion transfer, yet they require high-quality frontal views and face challenges in acquiring reference data for extended emotions (e.g., sarcasm). To address these limitations, we propose Cross-Modal Emotion Transfer (C-MET), a novel approach that generates facial expressions based on speeches by modeling emotion semantic vectors between speech and visual feature spaces. C-MET leverages a large-scale pretrained audio encoder and a disentangled facial expression encoder to learn emotion semantic vectors that represent the difference between two different emotional embeddings across modalities. Extensive experiments on the MEAD and CREMA-D datasets demonstrate that our method improves emotion accuracy by 14% over state-of-the-art methods, while generating expressive talking face videos - even for unseen extended emotions. Code, checkpoint, and demo are available at https://chanhyeok-choi.github.io/C-MET/

대화형 얼굴 비디오에서의 감정 편집을 위한 교차 모달 감정 전이

Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video

초록

Support