지각적으로 정확한 3D 토킹 헤드 생성: 새로운 정의, 음성-메시 표현 및 평가 지표

초록

최근 음성 기반 3D 토킹 헤드 생성 기술은 입술 동기화 측면에서 상당한 진전을 이루었습니다. 그러나 기존 모델들은 다양한 음성 특성과 이에 상응하는 입술 움직임 간의 지각적 정렬을 포착하는 데 여전히 어려움을 겪고 있습니다. 본 연구에서는 시간적 동기화(Temporal Synchronization), 입술 가독성(Lip Readability), 표현력(Expressiveness)이라는 세 가지 기준이 지각적으로 정확한 입술 움직임을 달성하는 데 중요하다고 주장합니다. 이 세 가지 기준을 충족할 수 있는 이상적인 표현 공간이 존재한다는 가설에 기반하여, 우리는 음성 신호와 3D 얼굴 메쉬 간의 복잡한 상관관계를 포착하는 음성-메쉬 동기화 표현을 제안합니다. 학습된 이 표현이 바람직한 특성을 보임을 확인하고, 이를 기존 모델에 지각적 손실(perceptual loss)로 적용하여 주어진 음성에 맞춰 입술 움직임을 더 잘 정렬할 수 있도록 했습니다. 또한, 이 표현을 지각적 지표로 활용하고, 물리적으로 타당한 두 가지 입술 동기화 지표를 추가로 도입하여 생성된 3D 토킹 헤드가 이 세 가지 기준에 얼마나 잘 부합하는지 평가합니다. 실험 결과, 우리의 지각적 손실을 사용하여 3D 토킹 헤드 생성 모델을 학습시키면 지각적으로 정확한 입술 동기화의 세 가지 측면 모두에서 상당한 개선이 이루어짐을 확인했습니다. 코드와 데이터셋은 https://perceptual-3d-talking-head.github.io/에서 확인할 수 있습니다.

English

Recent advancements in speech-driven 3D talking head generation have made significant progress in lip synchronization. However, existing models still struggle to capture the perceptual alignment between varying speech characteristics and corresponding lip movements. In this work, we claim that three criteria -- Temporal Synchronization, Lip Readability, and Expressiveness -- are crucial for achieving perceptually accurate lip movements. Motivated by our hypothesis that a desirable representation space exists to meet these three criteria, we introduce a speech-mesh synchronized representation that captures intricate correspondences between speech signals and 3D face meshes. We found that our learned representation exhibits desirable characteristics, and we plug it into existing models as a perceptual loss to better align lip movements to the given speech. In addition, we utilize this representation as a perceptual metric and introduce two other physically grounded lip synchronization metrics to assess how well the generated 3D talking heads align with these three criteria. Experiments show that training 3D talking head generation models with our perceptual loss significantly improve all three aspects of perceptually accurate lip synchronization. Codes and datasets are available at https://perceptual-3d-talking-head.github.io/.

지각적으로 정확한 3D 토킹 헤드 생성: 새로운 정의, 음성-메시 표현 및 평가 지표

Perceptually Accurate 3D Talking Head Generation: New Definitions, Speech-Mesh Representation, and Evaluation Metrics

초록

Support