HunyuanCustom:面向定制化视频生成的多模态驱动架构
HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation
May 7, 2025
作者: Teng Hu, Zhentao Yu, Zhengguang Zhou, Sen Liang, Yuan Zhou, Qin Lin, Qinglin Lu
cs.AI
摘要
定制视频生成旨在根据用户灵活定义的条件,生成包含特定主体的视频,然而现有方法在身份一致性和输入模态多样性方面往往存在局限。本文提出HunyuanCustom,一种多模态定制视频生成框架,强调主体一致性的同时,支持图像、音频、视频及文本条件。基于HunyuanVideo,我们的模型首先通过引入基于LLaVA的文本-图像融合模块来增强多模态理解能力,并采用图像ID增强模块,利用时间序列拼接强化跨帧身份特征,从而解决图像-文本条件下的生成任务。为实现音频和视频条件下的生成,我们进一步提出了模态特定的条件注入机制:AudioNet模块通过空间交叉注意力实现层次化对齐,以及视频驱动注入模块,通过基于分块的特征对齐网络整合潜在压缩的条件视频。在单主体和多主体场景下的广泛实验表明,HunyuanCustom在ID一致性、真实感和文本-视频对齐方面显著优于当前最先进的开放和闭源方法。此外,我们验证了其在下游任务中的鲁棒性,包括音频和视频驱动的定制视频生成。我们的结果凸显了多模态条件注入和身份保持策略在推进可控视频生成方面的有效性。所有代码和模型均可在https://hunyuancustom.github.io获取。
English
Customized video generation aims to produce videos featuring specific
subjects under flexible user-defined conditions, yet existing methods often
struggle with identity consistency and limited input modalities. In this paper,
we propose HunyuanCustom, a multi-modal customized video generation framework
that emphasizes subject consistency while supporting image, audio, video, and
text conditions. Built upon HunyuanVideo, our model first addresses the
image-text conditioned generation task by introducing a text-image fusion
module based on LLaVA for enhanced multi-modal understanding, along with an
image ID enhancement module that leverages temporal concatenation to reinforce
identity features across frames. To enable audio- and video-conditioned
generation, we further propose modality-specific condition injection
mechanisms: an AudioNet module that achieves hierarchical alignment via spatial
cross-attention, and a video-driven injection module that integrates
latent-compressed conditional video through a patchify-based feature-alignment
network. Extensive experiments on single- and multi-subject scenarios
demonstrate that HunyuanCustom significantly outperforms state-of-the-art open-
and closed-source methods in terms of ID consistency, realism, and text-video
alignment. Moreover, we validate its robustness across downstream tasks,
including audio and video-driven customized video generation. Our results
highlight the effectiveness of multi-modal conditioning and identity-preserving
strategies in advancing controllable video generation. All the code and models
are available at https://hunyuancustom.github.io.Summary
AI-Generated Summary