ChatAnyone:基於層次化運動擴散模型的風格化實時肖像視頻生成
ChatAnyone: Stylized Real-time Portrait Video Generation with Hierarchical Motion Diffusion Model
March 27, 2025
作者: Jinwei Qi, Chaonan Ji, Sheng Xu, Peng Zhang, Bang Zhang, Liefeng Bo
cs.AI
摘要
實時互動視頻聊天肖像已被日益視為未來趨勢,這尤其得益於文本和語音聊天技術的顯著進步。然而,現有方法主要集中於實時生成頭部動作,但在產生與這些頭部動作同步的身體運動方面仍存在困難。此外,實現對說話風格和面部表情細微變化的精細控制仍是一大挑戰。為解決這些限制,我們提出了一種新穎的框架,用於生成風格化的實時肖像視頻,從而實現從說話頭像到上半身互動的富有表現力和靈活的視頻聊天。我們的方法包含以下兩個階段。第一階段涉及高效的層次化運動擴散模型,該模型基於音頻輸入考慮了顯式和隱式的運動表示,能夠生成多樣化的面部表情,並實現頭部與身體運動的風格化控制與同步。第二階段旨在生成包含上半身動作(包括手勢)的肖像視頻。我們向生成器注入顯式的手部控制信號,以產生更為細緻的手部動作,並進一步進行面部精修,以增強肖像視頻的整體真實感和表現力。此外,我們的方法支持在4090 GPU上以最高512*768分辨率和最高30fps的效率持續生成上半身肖像視頻,從而支持實時的互動視頻聊天。實驗結果表明,我們的方法能夠生成具有豐富表現力和自然上半身動作的肖像視頻。
English
Real-time interactive video-chat portraits have been increasingly recognized
as the future trend, particularly due to the remarkable progress made in text
and voice chat technologies. However, existing methods primarily focus on
real-time generation of head movements, but struggle to produce synchronized
body motions that match these head actions. Additionally, achieving
fine-grained control over the speaking style and nuances of facial expressions
remains a challenge. To address these limitations, we introduce a novel
framework for stylized real-time portrait video generation, enabling expressive
and flexible video chat that extends from talking head to upper-body
interaction. Our approach consists of the following two stages. The first stage
involves efficient hierarchical motion diffusion models, that take both
explicit and implicit motion representations into account based on audio
inputs, which can generate a diverse range of facial expressions with stylistic
control and synchronization between head and body movements. The second stage
aims to generate portrait video featuring upper-body movements, including hand
gestures. We inject explicit hand control signals into the generator to produce
more detailed hand movements, and further perform face refinement to enhance
the overall realism and expressiveness of the portrait video. Additionally, our
approach supports efficient and continuous generation of upper-body portrait
video in maximum 512 * 768 resolution at up to 30fps on 4090 GPU, supporting
interactive video-chat in real-time. Experimental results demonstrate the
capability of our approach to produce portrait videos with rich expressiveness
and natural upper-body movements.Summary
AI-Generated Summary