EMO: 감정이 담긴 초상화를 살아 움직이게 하기 - 약한 조건 하에서 오디오-비디오 확산 모델을 활용한 표현력 있는 초상화 비디오 생성

초록

본 연구에서는 오디오 신호와 얼굴 움직임 간의 역동적이고 미묘한 관계에 초점을 맞춰, 말하는 머리(talking head) 비디오 생성의 사실성과 표현력을 향상시키는 과제를 다룹니다. 우리는 기존 기술이 인간의 다양한 표정 전체와 개인의 독특한 얼굴 스타일을 충분히 포착하지 못하는 한계를 확인했습니다. 이러한 문제를 해결하기 위해, 중간 단계의 3D 모델이나 얼굴 랜드마크를 거치지 않고 직접 오디오에서 비디오로 합성하는 새로운 프레임워크인 EMO를 제안합니다. 우리의 방법은 비디오 전반에 걸쳐 원활한 프레임 전환과 일관된 정체성 유지를 보장하여, 매우 표현력 있고 생생한 애니메이션을 생성합니다. 실험 결과, EMO는 설득력 있는 말하기 비디오뿐만 아니라 다양한 스타일의 노래 비디오도 생성할 수 있으며, 표현력과 사실성 측면에서 기존의 최신 방법론들을 크게 능가하는 것으로 나타났습니다.

English

In this work, we tackle the challenge of enhancing the realism and expressiveness in talking head video generation by focusing on the dynamic and nuanced relationship between audio cues and facial movements. We identify the limitations of traditional techniques that often fail to capture the full spectrum of human expressions and the uniqueness of individual facial styles. To address these issues, we propose EMO, a novel framework that utilizes a direct audio-to-video synthesis approach, bypassing the need for intermediate 3D models or facial landmarks. Our method ensures seamless frame transitions and consistent identity preservation throughout the video, resulting in highly expressive and lifelike animations. Experimental results demonsrate that EMO is able to produce not only convincing speaking videos but also singing videos in various styles, significantly outperforming existing state-of-the-art methodologies in terms of expressiveness and realism.

EMO: 감정이 담긴 초상화를 살아 움직이게 하기 - 약한 조건 하에서 오디오-비디오 확산 모델을 활용한 표현력 있는 초상화 비디오 생성

EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions

초록

Support