解锁姿态多样性:基于隐式关键点的精准高效时空扩散模型用于音频驱动说话人像生成
Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait
March 17, 2025
作者: Chaolong Yang, Kai Yao, Yuyao Yan, Chenru Jiang, Weiguang Zhao, Jie Sun, Guangliang Cheng, Yifei Zhang, Bin Dong, Kaizhu Huang
cs.AI
摘要
音频驱动的单图像说话肖像生成在虚拟现实、数字人创作及电影制作中扮演着关键角色。现有方法主要分为基于关键点与基于图像的两类。基于关键点的方法虽能有效保持角色身份,但因3D可变形模型固定点限制,难以捕捉精细面部细节。此外,传统生成网络在有限数据集上建立音频与关键点间因果关系面临挑战,导致姿态多样性不足。相比之下,基于图像的方法利用扩散网络生成细节丰富的高质量肖像,但存在身份失真及计算成本高昂的问题。本研究中,我们提出了KDTalker,首个结合无监督隐式3D关键点与时空扩散模型的框架。KDTalker通过无监督隐式3D关键点,自适应面部信息密度,使扩散过程能灵活建模多样头部姿态并捕捉精细面部细节。定制设计的时空注意力机制确保了准确的唇形同步,生成时间一致的高质量动画,同时提升了计算效率。实验结果表明,KDTalker在唇形同步精度、头部姿态多样性及执行效率方面均达到了业界领先水平。我们的代码已发布于https://github.com/chaolongy/KDTalker。
English
Audio-driven single-image talking portrait generation plays a crucial role in
virtual reality, digital human creation, and filmmaking. Existing approaches
are generally categorized into keypoint-based and image-based methods.
Keypoint-based methods effectively preserve character identity but struggle to
capture fine facial details due to the fixed points limitation of the 3D
Morphable Model. Moreover, traditional generative networks face challenges in
establishing causality between audio and keypoints on limited datasets,
resulting in low pose diversity. In contrast, image-based approaches produce
high-quality portraits with diverse details using the diffusion network but
incur identity distortion and expensive computational costs. In this work, we
propose KDTalker, the first framework to combine unsupervised implicit 3D
keypoint with a spatiotemporal diffusion model. Leveraging unsupervised
implicit 3D keypoints, KDTalker adapts facial information densities, allowing
the diffusion process to model diverse head poses and capture fine facial
details flexibly. The custom-designed spatiotemporal attention mechanism
ensures accurate lip synchronization, producing temporally consistent,
high-quality animations while enhancing computational efficiency. Experimental
results demonstrate that KDTalker achieves state-of-the-art performance
regarding lip synchronization accuracy, head pose diversity, and execution
efficiency.Our codes are available at https://github.com/chaolongy/KDTalker.Summary
AI-Generated Summary