ReSyncer:為統一的音視覺同步面部表演者重新連線風格生成器
ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer
August 6, 2024
作者: Jiazhi Guan, Zhiliang Xu, Hang Zhou, Kaisiyuan Wang, Shengyi He, Zhanwang Zhang, Borong Liang, Haocheng Feng, Errui Ding, Jingtuo Liu, Jingdong Wang, Youjian Zhao, Ziwei Liu
cs.AI
摘要
擁有特定音訊的唇語同步影片是各種應用的基礎,包括創建虛擬主持人或表演者。雖然最近的研究探索了使用不同技術實現高保真度的唇語同步,但它們的任務導向模型要麼需要長期影片進行特定片段的訓練,要麼會保留可見的瑕疵。在本文中,我們提出了一個統一且有效的框架 ReSyncer,用於同步通用的音視頻面部信息。其關鍵設計是重新設計並重組基於風格的生成器,以有效採用由基於原則的風格注入 Transformer 預測的 3D 面部動態。通過簡單地重新配置噪聲和風格空間內的信息插入機制,我們的框架將運動和外觀與統一的訓練融合在一起。大量實驗表明,ReSyncer 不僅根據音訊生成高保真度的唇語同步影片,還支持多種適用於創建虛擬主持人和表演者的吸引人特性,包括快速個性化微調、基於影片的唇語同步、說話風格的轉移,甚至是臉部交換。有關資源可在 https://guanjz20.github.io/projects/ReSyncer 找到。
English
Lip-syncing videos with given audio is the foundation for various
applications including the creation of virtual presenters or performers. While
recent studies explore high-fidelity lip-sync with different techniques, their
task-orientated models either require long-term videos for clip-specific
training or retain visible artifacts. In this paper, we propose a unified and
effective framework ReSyncer, that synchronizes generalized audio-visual facial
information. The key design is revisiting and rewiring the Style-based
generator to efficiently adopt 3D facial dynamics predicted by a principled
style-injected Transformer. By simply re-configuring the information insertion
mechanisms within the noise and style space, our framework fuses motion and
appearance with unified training. Extensive experiments demonstrate that
ReSyncer not only produces high-fidelity lip-synced videos according to audio,
but also supports multiple appealing properties that are suitable for creating
virtual presenters and performers, including fast personalized fine-tuning,
video-driven lip-syncing, the transfer of speaking styles, and even face
swapping. Resources can be found at
https://guanjz20.github.io/projects/ReSyncer.Summary
AI-Generated Summary