오디오에서 포토리얼 구현체로: 대화 속 인간 합성하기

초록

우리는 대화적 상호작용의 역동성에 따라 제스처를 취하는 완전한 몸체의 사실적인 아바타를 생성하기 위한 프레임워크를 제시합니다. 음성 오디오를 입력으로 받아, 개인의 얼굴, 몸, 손을 포함한 다양한 제스처 동작의 가능성을 출력합니다. 우리 방법의 핵심은 벡터 양자화로부터 얻은 샘플 다양성의 이점과 확산을 통해 얻은 고주파 세부 정보를 결합하여 더욱 역동적이고 표현력 있는 동작을 생성하는 데 있습니다. 생성된 동작은 미세한 제스처(예: 비웃음과 씩 웃음)를 표현할 수 있는 고도로 사실적인 아바타를 통해 시각화됩니다. 이러한 연구를 촉진하기 위해, 사실적인 재구성을 가능하게 하는 최초의 다중 시점 대화 데이터셋을 소개합니다. 실험 결과, 우리 모델은 적절하고 다양한 제스처를 생성하며, 확산 및 VQ 전용 방법을 모두 능가하는 성능을 보입니다. 또한, 우리의 지각 평가는 대화적 제스처에서 미세한 동작 세부 사항을 정확히 평가하는 데 있어 사실성(메시 대비)의 중요성을 강조합니다. 코드와 데이터셋은 온라인에서 이용 가능합니다.

English

We present a framework for generating full-bodied photorealistic avatars that gesture according to the conversational dynamics of a dyadic interaction. Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands. The key behind our method is in combining the benefits of sample diversity from vector quantization with the high-frequency details obtained through diffusion to generate more dynamic, expressive motion. We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures (e.g. sneers and smirks). To facilitate this line of research, we introduce a first-of-its-kind multi-view conversational dataset that allows for photorealistic reconstruction. Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods. Furthermore, our perceptual evaluation highlights the importance of photorealism (vs. meshes) in accurately assessing subtle motion details in conversational gestures. Code and dataset available online.

오디오에서 포토리얼 구현체로: 대화 속 인간 합성하기

From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations

초록

Support