HuMo:基於協作多模態條件的人本視訊生成
HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning
September 10, 2025
作者: Liyang Chen, Tianxiang Ma, Jiawei Liu, Bingchuan Li, Zhuowei Chen, Lijie Liu, Xu He, Gen Li, Qian He, Zhiyong Wu
cs.AI
摘要
以人為本的視頻生成(HCVG)方法旨在從多模態輸入(包括文本、圖像和音頻)中合成人類視頻。現有方法在有效協調這些異質模態方面面臨兩大挑戰:配對三元組條件訓練數據的稀缺性,以及在多模態輸入下協調主體保持和音視頻同步子任務的困難。在本研究中,我們提出了HuMo,一個統一的多模態協同控制HCVG框架。針對第一個挑戰,我們構建了一個高質量數據集,包含多樣化且配對的文本、參考圖像和音頻。針對第二個挑戰,我們提出了一種兩階段漸進式多模態訓練範式,並配備了任務專屬策略。對於主體保持任務,為了維持基礎模型的提示跟隨和視覺生成能力,我們採用了最小侵入式圖像注入策略。對於音視頻同步任務,除了普遍採用的音頻交叉注意力層外,我們提出了一種通過預測聚焦的策略,隱式引導模型將音頻與面部區域關聯起來。為了實現跨模態輸入可控性的聯合學習,基於先前獲取的能力,我們逐步整合了音視頻同步任務。在推理階段,為了實現靈活且細粒度的多模態控制,我們設計了一種時間自適應的無分類器指導策略,該策略在去噪步驟中動態調整指導權重。大量實驗結果表明,HuMo在各子任務上超越了專門的頂尖方法,為協同多模態條件下的HCVG建立了一個統一框架。項目頁面:https://phantom-video.github.io/HuMo。
English
Human-Centric Video Generation (HCVG) methods seek to synthesize human videos
from multimodal inputs, including text, image, and audio. Existing methods
struggle to effectively coordinate these heterogeneous modalities due to two
challenges: the scarcity of training data with paired triplet conditions and
the difficulty of collaborating the sub-tasks of subject preservation and
audio-visual sync with multimodal inputs. In this work, we present HuMo, a
unified HCVG framework for collaborative multimodal control. For the first
challenge, we construct a high-quality dataset with diverse and paired text,
reference images, and audio. For the second challenge, we propose a two-stage
progressive multimodal training paradigm with task-specific strategies. For the
subject preservation task, to maintain the prompt following and visual
generation abilities of the foundation model, we adopt the minimal-invasive
image injection strategy. For the audio-visual sync task, besides the commonly
adopted audio cross-attention layer, we propose a focus-by-predicting strategy
that implicitly guides the model to associate audio with facial regions. For
joint learning of controllabilities across multimodal inputs, building on
previously acquired capabilities, we progressively incorporate the audio-visual
sync task. During inference, for flexible and fine-grained multimodal control,
we design a time-adaptive Classifier-Free Guidance strategy that dynamically
adjusts guidance weights across denoising steps. Extensive experimental results
demonstrate that HuMo surpasses specialized state-of-the-art methods in
sub-tasks, establishing a unified framework for collaborative
multimodal-conditioned HCVG. Project Page:
https://phantom-video.github.io/HuMo.