ChatPaper.aiChatPaper

實境對話:具有3D面部先驗引導身份對齊網絡的實時和逼真音頻驅動人臉生成

RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network

June 26, 2024
作者: Xiaozhong Ji, Chuming Lin, Zhonggan Ding, Ying Tai, Jian Yang, Junwei Zhu, Xiaobin Hu, Jiangning Zhang, Donghao Luo, Chengjie Wang
cs.AI

摘要

在計算機視覺中,以音訊驅動的通用人臉生成是一項具有挑戰性的任務。先前的方法在音視覺同步方面取得了顯著進展,但目前的結果與實際應用之間仍存在顯著差距。挑戰主要包括兩個方面:1) 保留獨特的個人特徵以實現高精度的嘴唇同步。2) 在實時性能中生成高質量的面部渲染。在本文中,我們提出了一種新穎的通用音訊驅動框架 RealTalk,該框架包括音訊轉換為表情的模塊和高保真度的表情轉換為人臉的渲染器。在第一個模塊中,我們考慮了與說話嘴唇運動相關的身份和個人內部變化特徵。通過在豐富的面部先驗上引入跨模態注意力,我們可以有效地將嘴唇運動與音訊對齊,從而實現更高的表情預測精度。在第二個模塊中,我們設計了一個輕量級的面部身份對齊(FIA)模塊,其中包括嘴唇形狀控制結構和面部紋理參考結構。這種新穎的設計使我們能夠在實時生成細節,而無需依賴複雜且低效的特徵對齊模塊。我們在公共數據集上的實驗結果,無論是定量還是定性,都展示了我們的方法在嘴唇-語音同步和生成質量方面的明顯優勢。此外,我們的方法高效且需要較少的計算資源,使其非常適合滿足實際應用的需求。
English
Person-generic audio-driven face generation is a challenging task in computer vision. Previous methods have achieved remarkable progress in audio-visual synchronization, but there is still a significant gap between current results and practical applications. The challenges are two-fold: 1) Preserving unique individual traits for achieving high-precision lip synchronization. 2) Generating high-quality facial renderings in real-time performance. In this paper, we propose a novel generalized audio-driven framework RealTalk, which consists of an audio-to-expression transformer and a high-fidelity expression-to-face renderer. In the first component, we consider both identity and intra-personal variation features related to speaking lip movements. By incorporating cross-modal attention on the enriched facial priors, we can effectively align lip movements with audio, thus attaining greater precision in expression prediction. In the second component, we design a lightweight facial identity alignment (FIA) module which includes a lip-shape control structure and a face texture reference structure. This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules. Our experimental results, both quantitative and qualitative, on public datasets demonstrate the clear advantages of our method in terms of lip-speech synchronization and generation quality. Furthermore, our method is efficient and requires fewer computational resources, making it well-suited to meet the needs of practical applications.

Summary

AI-Generated Summary

PDF202November 28, 2024