ChatPaper.aiChatPaper

Loopy:通過長期運動依賴性來馴服音頻驅動的肖像化頭像

Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency

September 4, 2024
作者: Jianwen Jiang, Chao Liang, Jiaqi Yang, Gaojie Lin, Tianyun Zhong, Yanbo Zheng
cs.AI

摘要

隨著基於擴散的影片生成技術的引入,最近在音訊條件下的人類影片生成在動作的自然度和肖像細節的合成方面取得了顯著突破。由於在驅動人類動作時音訊信號的控制受限,現有方法通常會添加輔助空間信號以穩定動作,這可能會影響動作的自然度和自由度。在本文中,我們提出了一種名為 Loopy 的端對端僅音訊條件下的影片擴散模型。具體來說,我們設計了一個片內和片間時間模塊以及一個音訊到潛在模塊,使模型能夠利用來自數據的長期運動信息來學習自然運動模式,並改善音訊-肖像運動之間的相關性。該方法消除了現有方法中用於在推論期間限制運動的手動指定空間運動模板的需求。大量實驗表明,Loopy 優於最近的音訊驅動肖像擴散模型,在各種情境下提供更逼真和高質量的結果。
English
With the introduction of diffusion-based video generation techniques, audio-conditioned human video generation has recently achieved significant breakthroughs in both the naturalness of motion and the synthesis of portrait details. Due to the limited control of audio signals in driving human motion, existing methods often add auxiliary spatial signals to stabilize movements, which may compromise the naturalness and freedom of motion. In this paper, we propose an end-to-end audio-only conditioned video diffusion model named Loopy. Specifically, we designed an inter- and intra-clip temporal module and an audio-to-latents module, enabling the model to leverage long-term motion information from the data to learn natural motion patterns and improving audio-portrait movement correlation. This method removes the need for manually specified spatial motion templates used in existing methods to constrain motion during inference. Extensive experiments show that Loopy outperforms recent audio-driven portrait diffusion models, delivering more lifelike and high-quality results across various scenarios.

Summary

AI-Generated Summary

PDF9813November 16, 2024