音視頻控制下的視頻擴散與掩碼選擇性狀態空間建模 ——面向自然對話頭像生成的技術
Audio-visual Controlled Video Diffusion with Masked Selective State Spaces Modeling for Natural Talking Head Generation
April 3, 2025
作者: Fa-Ting Hong, Zunnan Xu, Zixiang Zhou, Jun Zhou, Xiu Li, Qin Lin, Qinglin Lu, Dan Xu
cs.AI
摘要
頭像合成技術對於虛擬化身和人機互動至關重要。然而,現有方法大多僅限於接受單一主要模態的控制,限制了其實際應用價值。為此,我們提出了ACTalker,這是一個端到端的視頻擴散框架,支持多信號控制和單信號控制,用於生成頭像視頻。針對多信號控制,我們設計了一種並行的Mamba結構,包含多個分支,每個分支利用獨立的驅動信號來控制特定的面部區域。所有分支之間應用了一個門控機制,提供了對視頻生成的靈活控制。為了確保受控視頻在時間和空間上的自然協調,我們採用了Mamba結構,使驅動信號能夠在各個分支中跨維度操縱特徵令牌。此外,我們引入了一種掩碼丟棄策略,允許每個驅動信號在Mamba結構內獨立控制其對應的面部區域,避免控制衝突。實驗結果表明,我們的方法能夠生成由多樣信號驅動的自然面部視頻,並且Mamba層能夠無縫整合多種驅動模態而不產生衝突。
English
Talking head synthesis is vital for virtual avatars and human-computer
interaction. However, most existing methods are typically limited to accepting
control from a single primary modality, restricting their practical utility. To
this end, we introduce ACTalker, an end-to-end video diffusion
framework that supports both multi-signals control and single-signal control
for talking head video generation. For multiple control, we design a parallel
mamba structure with multiple branches, each utilizing a separate driving
signal to control specific facial regions. A gate mechanism is applied across
all branches, providing flexible control over video generation. To ensure
natural coordination of the controlled video both temporally and spatially, we
employ the mamba structure, which enables driving signals to manipulate feature
tokens across both dimensions in each branch. Additionally, we introduce a
mask-drop strategy that allows each driving signal to independently control its
corresponding facial region within the mamba structure, preventing control
conflicts. Experimental results demonstrate that our method produces
natural-looking facial videos driven by diverse signals and that the mamba
layer seamlessly integrates multiple driving modalities without conflict.Summary
AI-Generated Summary