ChatAnyone:基于分层运动扩散模型的实时风格化人像视频生成
ChatAnyone: Stylized Real-time Portrait Video Generation with Hierarchical Motion Diffusion Model
March 27, 2025
作者: Jinwei Qi, Chaonan Ji, Sheng Xu, Peng Zhang, Bang Zhang, Liefeng Bo
cs.AI
摘要
实时交互式视频聊天肖像正日益被视为未来趋势,这尤其得益于文本和语音聊天技术取得的显著进展。然而,现有方法主要集中于实时生成头部动作,却难以同步生成与之匹配的身体运动。此外,实现对说话风格及面部表情细微变化的精细控制仍是一大挑战。为应对这些局限,我们提出了一种新颖的框架,用于生成风格化的实时肖像视频,从而支持从头部对话到上半身互动的富有表现力且灵活的视频聊天。我们的方法包含以下两个阶段:第一阶段采用高效的层次化运动扩散模型,该模型基于音频输入同时考虑显式和隐式运动表示,能够生成多样化的面部表情,并实现头部与身体动作的风格化控制及同步。第二阶段旨在生成包含上半身动作(如手势)的肖像视频。我们向生成器注入显式的手部控制信号,以产生更为细致的手部动作,并进一步进行面部优化,提升肖像视频的整体真实感与表现力。此外,我们的方法支持在4090 GPU上以最高512*768分辨率、最高30帧每秒的速度高效连续生成上半身肖像视频,确保实时交互式视频聊天的流畅性。实验结果表明,我们的方法能够生成具有丰富表现力和自然上半身动作的肖像视频。
English
Real-time interactive video-chat portraits have been increasingly recognized
as the future trend, particularly due to the remarkable progress made in text
and voice chat technologies. However, existing methods primarily focus on
real-time generation of head movements, but struggle to produce synchronized
body motions that match these head actions. Additionally, achieving
fine-grained control over the speaking style and nuances of facial expressions
remains a challenge. To address these limitations, we introduce a novel
framework for stylized real-time portrait video generation, enabling expressive
and flexible video chat that extends from talking head to upper-body
interaction. Our approach consists of the following two stages. The first stage
involves efficient hierarchical motion diffusion models, that take both
explicit and implicit motion representations into account based on audio
inputs, which can generate a diverse range of facial expressions with stylistic
control and synchronization between head and body movements. The second stage
aims to generate portrait video featuring upper-body movements, including hand
gestures. We inject explicit hand control signals into the generator to produce
more detailed hand movements, and further perform face refinement to enhance
the overall realism and expressiveness of the portrait video. Additionally, our
approach supports efficient and continuous generation of upper-body portrait
video in maximum 512 * 768 resolution at up to 30fps on 4090 GPU, supporting
interactive video-chat in real-time. Experimental results demonstrate the
capability of our approach to produce portrait videos with rich expressiveness
and natural upper-body movements.Summary
AI-Generated Summary