ChatPaper.aiChatPaper

RealTalk:基于实时和逼真音频驱动的3D面部先验引导身份对齐网络

RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network

June 26, 2024
作者: Xiaozhong Ji, Chuming Lin, Zhonggan Ding, Ying Tai, Jian Yang, Junwei Zhu, Xiaobin Hu, Jiangning Zhang, Donghao Luo, Chengjie Wang
cs.AI

摘要

在计算机视觉中,通用人物音频驱动的人脸生成是一项具有挑战性的任务。先前的方法在音频-视觉同步方面取得了显著进展,但目前的结果与实际应用之间仍存在显著差距。挑战主要包括两个方面:1)保留独特的个体特征以实现高精度的嘴唇同步;2)实时性能下生成高质量的面部渲染。在本文中,我们提出了一种新颖的通用音频驱动框架RealTalk,它包括一个音频到表情转换器和一个高保真度的表情到人脸渲染器。在第一个组件中,我们考虑了与说话嘴唇运动相关的身份和个人内部变化特征。通过在丰富的面部先验上融入跨模态注意力,我们可以有效地将嘴唇运动与音频对齐,从而实现更高的表情预测精度。在第二个组件中,我们设计了一个轻量级的面部身份对齐(FIA)模块,其中包括一个嘴唇形状控制结构和一个面部纹理参考结构。这种新颖的设计使我们能够实时生成细节,而无需依赖复杂且低效的特征对齐模块。我们在公共数据集上的实验结果,无论是定量还是定性的,都展示了我们的方法在嘴唇-语音同步和生成质量方面的明显优势。此外,我们的方法高效且需要较少的计算资源,使其非常适合满足实际应用的需求。
English
Person-generic audio-driven face generation is a challenging task in computer vision. Previous methods have achieved remarkable progress in audio-visual synchronization, but there is still a significant gap between current results and practical applications. The challenges are two-fold: 1) Preserving unique individual traits for achieving high-precision lip synchronization. 2) Generating high-quality facial renderings in real-time performance. In this paper, we propose a novel generalized audio-driven framework RealTalk, which consists of an audio-to-expression transformer and a high-fidelity expression-to-face renderer. In the first component, we consider both identity and intra-personal variation features related to speaking lip movements. By incorporating cross-modal attention on the enriched facial priors, we can effectively align lip movements with audio, thus attaining greater precision in expression prediction. In the second component, we design a lightweight facial identity alignment (FIA) module which includes a lip-shape control structure and a face texture reference structure. This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules. Our experimental results, both quantitative and qualitative, on public datasets demonstrate the clear advantages of our method in terms of lip-speech synchronization and generation quality. Furthermore, our method is efficient and requires fewer computational resources, making it well-suited to meet the needs of practical applications.

Summary

AI-Generated Summary

PDF202November 28, 2024