ChatPaper.aiChatPaper

解鎖姿態多樣性:基於隱式關鍵點的精確高效時空擴散技術在音頻驅動的說話肖像中的應用

Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait

March 17, 2025
作者: Chaolong Yang, Kai Yao, Yuyao Yan, Chenru Jiang, Weiguang Zhao, Jie Sun, Guangliang Cheng, Yifei Zhang, Bin Dong, Kaizhu Huang
cs.AI

摘要

音頻驅動的單圖像說話肖像生成在虛擬現實、數字人創建和電影製作中扮演著至關重要的角色。現有方法通常分為基於關鍵點和基於圖像的方法。基於關鍵點的方法能有效保留角色身份,但由於3D可變形模型的固定點限制,難以捕捉細微的面部細節。此外,傳統生成網絡在有限的數據集上建立音頻與關鍵點之間的因果關係面臨挑戰,導致姿態多樣性較低。相比之下,基於圖像的方法利用擴散網絡生成高質量且細節豐富的肖像,但存在身份失真和計算成本高昂的問題。在本研究中,我們提出了KDTalker,這是首個結合無監督隱式3D關鍵點與時空擴散模型的框架。通過利用無監督隱式3D關鍵點,KDTalker適應面部信息密度,使擴散過程能夠靈活地建模多樣的頭部姿態並捕捉細微的面部細節。定制的時空注意力機制確保了精確的唇形同步,生成時間一致的高質量動畫,同時提高了計算效率。實驗結果表明,KDTalker在唇形同步精度、頭部姿態多樣性和執行效率方面達到了最先進的性能。我們的代碼可在https://github.com/chaolongy/KDTalker獲取。
English
Audio-driven single-image talking portrait generation plays a crucial role in virtual reality, digital human creation, and filmmaking. Existing approaches are generally categorized into keypoint-based and image-based methods. Keypoint-based methods effectively preserve character identity but struggle to capture fine facial details due to the fixed points limitation of the 3D Morphable Model. Moreover, traditional generative networks face challenges in establishing causality between audio and keypoints on limited datasets, resulting in low pose diversity. In contrast, image-based approaches produce high-quality portraits with diverse details using the diffusion network but incur identity distortion and expensive computational costs. In this work, we propose KDTalker, the first framework to combine unsupervised implicit 3D keypoint with a spatiotemporal diffusion model. Leveraging unsupervised implicit 3D keypoints, KDTalker adapts facial information densities, allowing the diffusion process to model diverse head poses and capture fine facial details flexibly. The custom-designed spatiotemporal attention mechanism ensures accurate lip synchronization, producing temporally consistent, high-quality animations while enhancing computational efficiency. Experimental results demonstrate that KDTalker achieves state-of-the-art performance regarding lip synchronization accuracy, head pose diversity, and execution efficiency.Our codes are available at https://github.com/chaolongy/KDTalker.

Summary

AI-Generated Summary

PDF72March 20, 2025