HuMo:通过协作多模态条件实现以人为中心的视频生成
HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning
September 10, 2025
作者: Liyang Chen, Tianxiang Ma, Jiawei Liu, Bingchuan Li, Zhuowei Chen, Lijie Liu, Xu He, Gen Li, Qian He, Zhiyong Wu
cs.AI
摘要
以人为中心的视频生成(HCVG)方法旨在从多模态输入(包括文本、图像和音频)中合成人类视频。现有方法在有效协调这些异质模态方面面临两大挑战:一是缺乏包含配对三元组条件的训练数据,二是在多模态输入下难以协同主体保持与音视频同步这两个子任务。本研究提出了HuMo,一个用于协同多模态控制的统一HCVG框架。针对第一个挑战,我们构建了一个高质量数据集,其中包含多样化的配对文本、参考图像和音频。对于第二个挑战,我们提出了一种两阶段渐进式多模态训练范式,并采用任务特定策略。在主体保持任务中,为了维持基础模型的提示跟随和视觉生成能力,我们采用了最小侵入性的图像注入策略。在音视频同步任务中,除了常用的音频交叉注意力层外,我们提出了一种“预测聚焦”策略,隐式引导模型将音频与面部区域关联起来。为了跨多模态输入的联合可控性学习,基于先前获得的能力,我们逐步整合音视频同步任务。在推理阶段,为了实现灵活且细粒度的多模态控制,我们设计了一种时间自适应的无分类器引导策略,动态调整去噪步骤中的引导权重。大量实验结果表明,HuMo在子任务上超越了专门的先进方法,为协同多模态条件下的HCVG建立了一个统一框架。项目页面:https://phantom-video.github.io/HuMo。
English
Human-Centric Video Generation (HCVG) methods seek to synthesize human videos
from multimodal inputs, including text, image, and audio. Existing methods
struggle to effectively coordinate these heterogeneous modalities due to two
challenges: the scarcity of training data with paired triplet conditions and
the difficulty of collaborating the sub-tasks of subject preservation and
audio-visual sync with multimodal inputs. In this work, we present HuMo, a
unified HCVG framework for collaborative multimodal control. For the first
challenge, we construct a high-quality dataset with diverse and paired text,
reference images, and audio. For the second challenge, we propose a two-stage
progressive multimodal training paradigm with task-specific strategies. For the
subject preservation task, to maintain the prompt following and visual
generation abilities of the foundation model, we adopt the minimal-invasive
image injection strategy. For the audio-visual sync task, besides the commonly
adopted audio cross-attention layer, we propose a focus-by-predicting strategy
that implicitly guides the model to associate audio with facial regions. For
joint learning of controllabilities across multimodal inputs, building on
previously acquired capabilities, we progressively incorporate the audio-visual
sync task. During inference, for flexible and fine-grained multimodal control,
we design a time-adaptive Classifier-Free Guidance strategy that dynamically
adjusts guidance weights across denoising steps. Extensive experimental results
demonstrate that HuMo surpasses specialized state-of-the-art methods in
sub-tasks, establishing a unified framework for collaborative
multimodal-conditioned HCVG. Project Page:
https://phantom-video.github.io/HuMo.