DREAM-Talk: 단일 이미지 기반 얼굴 영상 생성을 위한 확산 모델 기반의 사실적인 감정 음성 주도 방식

초록

단일 초상화 이미지로부터 감정이 담긴 말하는 얼굴을 생성하는 것은 여전히 큰 도전 과제로 남아 있습니다. 특히, 표현력 있는 감정적 말하기와 정확한 립싱크를 동시에 달성하는 것은 매우 어려운데, 이는 립싱크의 정확성을 위해 표현력이 희생되는 경우가 많기 때문입니다. 많은 기존 연구에서 널리 채택된 LSTM 네트워크는 감정 표현의 미묘한 차이와 변화를 포착하는 데 종종 실패합니다. 이러한 문제를 해결하기 위해, 우리는 다양한 표현과 정확한 립싱크를 동시에 생성하도록 설계된 2단계 확산 기반 오디오 주도 프레임워크인 DREAM-Talk를 소개합니다. 첫 번째 단계에서는, 오디오와 참조된 감정 스타일에 따라 다양하고 역동적인 감정 표현과 머리 자세를 생성하는 새로운 확산 모듈인 EmoDiff를 제안합니다. 립 모션과 오디오 간의 강한 상관관계를 고려하여, 오디오 특징과 감정 스타일을 사용하여 립싱크 정확도를 향상시킨 동적 요소를 개선합니다. 이를 위해, 우리는 프록시 3D 아바타에서 임의의 초상화로 표현과 립 모션을 전달하는 비디오-투-비디오 렌더링 모듈을 배치합니다. 양적 및 질적으로, DREAM-Talk는 표현력, 립싱크 정확도 및 지각적 품질 측면에서 최첨단 방법들을 능가하는 성능을 보여줍니다.

English

The generation of emotional talking faces from a single portrait image remains a significant challenge. The simultaneous achievement of expressive emotional talking and accurate lip-sync is particularly difficult, as expressiveness is often compromised for the accuracy of lip-sync. As widely adopted by many prior works, the LSTM network often fails to capture the subtleties and variations of emotional expressions. To address these challenges, we introduce DREAM-Talk, a two-stage diffusion-based audio-driven framework, tailored for generating diverse expressions and accurate lip-sync concurrently. In the first stage, we propose EmoDiff, a novel diffusion module that generates diverse highly dynamic emotional expressions and head poses in accordance with the audio and the referenced emotion style. Given the strong correlation between lip motion and audio, we then refine the dynamics with enhanced lip-sync accuracy using audio features and emotion style. To this end, we deploy a video-to-video rendering module to transfer the expressions and lip motions from our proxy 3D avatar to an arbitrary portrait. Both quantitatively and qualitatively, DREAM-Talk outperforms state-of-the-art methods in terms of expressiveness, lip-sync accuracy and perceptual quality.

DREAM-Talk: 단일 이미지 기반 얼굴 영상 생성을 위한 확산 모델 기반의 사실적인 감정 음성 주도 방식

DREAM-Talk: Diffusion-based Realistic Emotional Audio-driven Method for Single Image Talking Face Generation

초록

Support