音視頻控制下的視頻擴散與掩碼選擇性狀態空間建模 ——面向自然對話頭像生成的技術

摘要

頭像合成技術對於虛擬化身和人機互動至關重要。然而，現有方法大多僅限於接受單一主要模態的控制，限制了其實際應用價值。為此，我們提出了ACTalker，這是一個端到端的視頻擴散框架，支持多信號控制和單信號控制，用於生成頭像視頻。針對多信號控制，我們設計了一種並行的Mamba結構，包含多個分支，每個分支利用獨立的驅動信號來控制特定的面部區域。所有分支之間應用了一個門控機制，提供了對視頻生成的靈活控制。為了確保受控視頻在時間和空間上的自然協調，我們採用了Mamba結構，使驅動信號能夠在各個分支中跨維度操縱特徵令牌。此外，我們引入了一種掩碼丟棄策略，允許每個驅動信號在Mamba結構內獨立控制其對應的面部區域，避免控制衝突。實驗結果表明，我們的方法能夠生成由多樣信號驅動的自然面部視頻，並且Mamba層能夠無縫整合多種驅動模態而不產生衝突。

English

Talking head synthesis is vital for virtual avatars and human-computer interaction. However, most existing methods are typically limited to accepting control from a single primary modality, restricting their practical utility. To this end, we introduce ACTalker, an end-to-end video diffusion framework that supports both multi-signals control and single-signal control for talking head video generation. For multiple control, we design a parallel mamba structure with multiple branches, each utilizing a separate driving signal to control specific facial regions. A gate mechanism is applied across all branches, providing flexible control over video generation. To ensure natural coordination of the controlled video both temporally and spatially, we employ the mamba structure, which enables driving signals to manipulate feature tokens across both dimensions in each branch. Additionally, we introduce a mask-drop strategy that allows each driving signal to independently control its corresponding facial region within the mamba structure, preventing control conflicts. Experimental results demonstrate that our method produces natural-looking facial videos driven by diverse signals and that the mamba layer seamlessly integrates multiple driving modalities without conflict.

音視頻控制下的視頻擴散與掩碼選擇性狀態空間建模 ——面向自然對話頭像生成的技術

Audio-visual Controlled Video Diffusion with Masked Selective State Spaces Modeling for Natural Talking Head Generation

摘要

Support